keyatm / keyatm Goto Github PK

View Code? Open in Web Editor NEW

95.0 8.0 14.0 51.06 MB

An R package for Keyword Assisted Topic Models

Home Page: https://keyatm.github.io/keyATM/

License: GNU General Public License v3.0

R 51.60% C++ 48.40%

topic-models political-science natural-language-processing latent-dirichlet-allocation social-science r rcpp rcppeigen

keyatm's Introduction

About

An R package for Keyword Assisted Topic Models, created by Shusei Eshima, Tomoya Sasaki, and Kosuke Imai.

Website

Please visit our website for a complete reference.

keyatm's People

Contributors

Stargazers

Watchers

Forkers

shiraito chainsawriot romainfrancois solivella carlosantagiustina kbenoit codinggarch mhudecheck k-nakam sysilviakim caiyishu sym3 wenhangao mchughmatthew

keyatm's Issues

Not compatible with requested type

Problem summary (required):

Not compatible with requested type error when fitting.

Environment information (required):

sessionInfo()  # please run this in R and copy&paste the output

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin18.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.6_1/lib/libopenblasp-r0.3.6.dylib

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_2.0.1 keyATM_0.1.0   nvimcom_0.9-83

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4         magrittr_1.5       stopwords_1.0      tidyselect_1.0.0  
 [5] munsell_0.5.0      colorspace_1.4-1   lattice_0.20-38    R6_2.4.1          
 [9] rlang_0.4.5        fastmap_1.0.1      fastmatch_1.1-0    stringr_1.4.0     
[13] dplyr_0.8.4        tools_3.6.1        parallel_3.6.1     grid_3.6.1        
[17] data.table_1.12.8  gtable_0.3.0       RcppParallel_5.0.0 assertthat_0.2.1  
[21] tibble_2.1.3       lifecycle_0.1.0    crayon_1.3.4       Matrix_1.2-18     
[25] purrr_0.3.3        ggplot2_3.3.0      glue_1.3.2         stringi_1.4.6     
[29] compiler_3.6.1     pillar_1.4.3       scales_1.1.0       pkgconfig_2.0.3

Actual output (required):

> out <- keyATM(docs = data$docs,
+                keywords = data$keywords,
+                no_keyword_topics = 0,
+                model = "base",
+                options = list(seed = 250, iterations = 10)
+                )
Initializing the model...
Warning in check_keywords(info$wd_names, keywords, options$prune) :
  A keyword will be pruned because it does not appear in documents: appointment
Fitting the model. 10 iterations...
Error in keyATM_fit_base(key_model, iter = options$iterations) : 
  Not compatible with requested type: [type=NULL; target=integer].

Expected output (it helps us a lot):

It should run.

Conditional posterior of alpha

https://github.com/keyATM/keyATM/blob/ac78d1fd563acc9e0800e5b1342a2bb5374b0ea9/src/keyATM_base.cpp#L157C7-L157C7

Can I check why is this line inside the loop?
It seems to me that this should only be done once outside the loop when referring to the log of the conditional posterior distribution.
Or is there something I am missing?

Add progress_bar argument to the keyATM_read() function

Requested feature

It would be great if a progress bar (see this example) is added to track the running time of the keyATM_read() function.

# Example
keyATM_read(progress_bar = TRUE)

Use case

A progress bar could help track the remaining time to run the keyATM_read() function.

Additional context

I noticed that it took quite a while for the keyATM_read() function to parse a large document-term matrix object. I'm currently using tictoc::tic() and toc() to document the running time. It would be convenient, however, if progress_bar option can be provided as one of the arguments of the function.

Clarify computation of weights_type = "inv-freq"

In the configuration of keyATM, one of the parameter to be specified is weights_type. My understanding is that "information-theory" refers to -log base 2, which is presented in your paper. I would like to clarify how do you define the computation of inverse frequency. Thank you.

Error with proportion column in KeyATM()

I ran a keyATM on just a subset of 5000 docs and I got this error:

> model.keyATM <- keyATM(    
  docs              = keyATM_docs,
  no_keyword_topics = NUM_TOPICS,
  keywords          = KEYWORDS,
  model             = "cov", 
  model_settings    = list(covariates_data = data.matrix(stm_dfm$meta),
                           covariates_formula = ~ as.factor(meetingType))
)
Initializing the model...
Fitting the model. 1500 iterations...
0%   10   20   30   40   50   60   70   80   90   100%
[----|----|----|----|----|----|----|----|----|----|
**************************************************|
Creating an output object. It may take time...
Error: Column `Proportion` must be length 40 (the number of rows) or one, not 47

This was an error internal to keyATM() - any idea what this might be about?

Dropping empty documents silently

keyATM_read() raises an error if there is an empty document. Should it drop the document silently?

My position is that researchers should be explicit about all the modifications to the data. The number of documents that researchers think they use should match with the number actually used.

@tomoya-sasaki?

ProgressBar for keyATM

Progress Bar for keyATM

It would be nice to monitor the iterations of keyATM, for instance in shiny

# Example
iter <- 1500
shiny::withProgress(message("Running keyATM", max=iter)
out <- keyATM::keyATM(docs=docs, no_keyword_topics=num_topics, keywords=keyw, model= "base", options=list(iterations=iter, pb=shiny::setProgress))

Use case

I would like to use this great package from within Shiny.

Prediction of new documents for the base model

Requested feature

Hello, I haven't found a method to predict the topic of a new document with the Base model. If this is true, could you please provide this feature?

Keyword selection algorithm

Hey,
I hope this is the right channel to address this question:
In the preparation section on the keyATM website, you write: "Researchers can use other methods such as a keyword selection algorithm proposed in King, Lam and Roberts (2017)."

However, the keyAMT package does not seem to have a function for this algorithm. Did I overlook something, or was this sentence meant to encourage the readers to make this implementation themselves?

Thanks in advance!

Compatibility with LDA measures for model comparison?

Requested feature

I really enjoyed working with this package - thank you for all the work on it!

What I do miss when working with keyATM is a feature that would enable a comparison among models with different number of topics based on several existing measures.

In particular, the package LDAtuning by @nikita-moor has been of immense help when working with LDA implementations (https://cran.r-project.org/web/packages/ldatuning/index.html).

I was wondering if any of those measures could be also used with keyATM?

I wanted to check whether some of the functions in LDAtuning could be "adjusted" to work with the keyATM base model.
However, since I do not understand all the nuances of how keyATM works compared to LDA models, I was not even sure if any of these adjustments would be valid.

The measures calculated by the LDAtuning package rely on output of LDA models built with topicmodels. So, by reading through the keyATM documentation, I concluded the following:

I gathered that

LDAmodel@logLiks

would be matched by the following for keyATM models:

keyATMmodel$model_fit$`Log Likelihood`

Other metrics call for the beta probabilities of terms over topics:

LDAmodel@beta

Reading the documentations of keyATM, I concluded this is comparable to phi?

keyATMmodel$phi

And the posterior topic distributions:

LDAmodel@gamma

would correspond to:

keyATMmodel$theta

Would any of functions relying of the LDA model outputs above work with keyATM as well?

Thank you!

Vignette for parallel_init

I successfully fit keyATM with the default single thread mode. I'm interested in going to parallel processing, since I have a 16 core CPU and lots of RAM. The help suggests to use future::plan() but I don't see any further documentation or vignettes or help about this. I have used doParallel() and foreach() before, but I know future is a whole new paradigm.

Can I ask for an example to try to work from, specifically with the keyATM example. For example, changing the keyATM base example to go from single to multithreaded.

`incorrect number of dimensions` if there is only one keyword-topic

Problem summary (required):

incorrect number of dimensions if there is only one keyword-topic

Environment information (required):

> sessionInfo()
R version 3.6.1 (2019-07-05)
Platform: x86_64-apple-darwin18.6.0 (64-bit)
Running under: macOS Mojave 10.14.6

Matrix products: default
BLAS/LAPACK: /usr/local/Cellar/openblas/0.3.6_1/lib/libopenblasp-r0.3.6.dylib

locale:
[1] C/UTF-8/C/C/C/C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] knitr_1.26      quanteda_2.0.1  forcats_0.4.0   stringr_1.4.0  
 [5] dplyr_0.8.4     purrr_0.3.3     readr_1.3.1     tidyr_1.0.2    
 [9] tibble_2.1.3    ggplot2_3.3.0   tidyverse_1.2.1 keyATM_0.1.0   
[13] nvimcom_0.9-83 

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4         pillar_1.4.3       compiler_3.6.1     cellranger_1.1.0  
 [5] stopwords_1.0      tools_3.6.1        jsonlite_1.6.1     lubridate_1.7.4   
 [9] lifecycle_0.1.0    nlme_3.1-140       gtable_0.3.0       lattice_0.20-38   
[13] pkgconfig_2.0.3    rlang_0.4.5        fastmatch_1.1-0    Matrix_1.2-18     
[17] cli_2.0.1          rstudioapi_0.11    parallel_3.6.1     xfun_0.11         
[21] haven_2.1.1        fastmap_1.0.1      withr_2.1.2        httr_1.4.1        
[25] xml2_1.2.2         generics_0.0.2     vctrs_0.2.2        hms_0.5.2         
[29] grid_3.6.1         tidyselect_1.0.0   data.table_1.12.8  glue_1.3.2        
[33] R6_2.4.1           fansi_0.4.1        readxl_1.3.1       modelr_0.1.5      
[37] magrittr_1.5       ellipsis_0.3.0     backports_1.1.5    scales_1.1.0      
[41] assertthat_0.2.1   rvest_0.3.4        colorspace_1.4-1   stringi_1.4.6     
[45] RcppParallel_5.0.0 munsell_0.5.0      broom_0.5.2        crayon_1.3.4

Actual output (required):

Error in phi[, which(colnames(phi) %in% colnames(phi_))] : 
  incorrect number of dimensions

Expected output (it helps us a lot):

Should run.

A minimal reproducible example or a reprex (it helps us a lot):

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

# Please copy and paste the code. If possible, please upload the data file as `.rds`.

by_strata_DocTopic() Credible Intervals

Using keyATM 0.4.0

I am unable to change the credible interval inside the by_strata_DocTopic() function.
It works fine with the predict command, however changing the "ci" argument does not seem to influence the results.

Am I misunderstanding something?

"Saving the Model" gives wrong instructions

Problem summary (required):

The preparation document says to run these lines:
save(out, file = "SAVENAME.rds")
out <- readRDS(file = "SAVENAME.rds")

However, that doesn't work because R has separate functions for loading and reading RDS files. You can use save and load or saveRDS and loadRDS, but you can't mix them together. This is my first time coding in R, and it took me a while to figure out that problem.

Environment information (required):

sessionInfo()  # please run this in R and copy&paste the output
R version 4.3.3 (2024-02-29)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Sonoma 14.4.1

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Chicago
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.3.3

Actual output (required):

Error in readRDS(file = file_name) : unknown input format
Execution halted

Expected output (it helps us a lot):

The rds file to be loaded back into the program from file.

Error: Something goes wrong in sample_lambda_slice()

Problem summary (required):

When fitting the keyATM base model, I get "Error: Something goes wrong in sample_lambda_slice()" after a couple hundred iterations. What could be the source for this?

Let me know what other information you need.

I have fitted the same model with no_keyword_topics=0 before.

Settings for keyATM base model:

mod <- keyATM::keyATM(
  docs = keyATM_counts,
  no_keyword_topics = 2,
  keywords = marker_list,
  model = "base", 
  options = list(seed = 0
                         iterations = 1500,
                         verbose = TRUE,
                         llk_per = 100,
                         use_weights = TRUE,
                         weights_type = "inv-freq",
                         prune = TRUE,
                         thinning = 10,
                         store_theta = FALSE,
                         store_pi = FALSE,
                         parallel_init = FALSE)
)

Environment information (required):

R version 4.3.2 (2023-10-31)                                                                                                                                                                                
Platform: x86_64-conda-linux-gnu (64-bit)                                                                                                                                                                   
Running under: Ubuntu 22.04.3 LTS                                                                                                                                                                           
                                                                                                                                                                                                            
Matrix products: default                                                                                                                                                                                    
BLAS/LAPACK: /home/pschaefer/miniconda3/envs/r_env/lib/libopenblasp-r0.3.25.so;  LAPACK version 3.11.0                                                                                                      
                                                                                                                                                                                                            
locale:                                                                                                                                                                                                     
[1] C                                                                                                                                                                                                       
                                                                                                                                                                                                            
time zone: Europe/Berlin                                                                                                                                                                                    
tzcode source: system (glibc)                                                                                                                                                                               
                                                                                                                                                                                                            
attached base packages:                                                                                                                                                                                     
[1] stats     graphics  grDevices utils     datasets  methods   base                                                                                                                                        
                                                                                                                                                                                                            
other attached packages:                                                                                                                                                                                    
 [1] keyATM_0.5.0          blogdown_1.18         zellkonverter_1.12.1                                                                                                                                       
 [4] logging_0.10-108      here_1.0.1            cowplot_1.1.2                                                                                                                                              
 [7] lubridate_1.9.3       forcats_1.0.0         stringr_1.5.1                                                                                                                                              
[10] dplyr_1.1.4           purrr_1.0.2           readr_2.1.4          
[13] tidyr_1.3.0           tibble_3.2.1          ggplot2_3.4.4        
[16] tidyverse_2.0.0       quanteda_3.3.1        Matrix_1.6-4         
[19] MatrixGenerics_1.14.0 matrixStats_1.2.0    

loaded via a namespace (and not attached):
 [1] SummarizedExperiment_1.32.0 fastmatch_1.1-4             
 [3] gtable_0.3.4                dir.expiry_1.10.0           
 [5] xfun_0.41                   Biobase_2.62.0              
 [7] lattice_0.22-5              tzdb_0.4.0                  
 [9] vctrs_0.6.5                 tools_4.3.2                 
[11] bitops_1.0-7                generics_0.1.3              
[13] parallel_4.3.2              stats4_4.3.2                
[15] fansi_1.0.6                 pkgconfig_2.0.3             
[17] S4Vectors_0.40.2            RcppParallel_5.1.6          
[19] lifecycle_1.0.4             GenomeInfoDbData_1.2.11    
[21] compiler_4.3.2              munsell_0.5.0               
[23] GenomeInfoDb_1.38.1         RCurl_1.98-1.13             
[25] pillar_1.9.0                crayon_1.5.2                
[27] SingleCellExperiment_1.24.0 DelayedArray_0.28.0        
[29] abind_1.4-5                 basilisk_1.14.1             
[31] stopwords_2.3               tidyselect_1.2.0            
[33] stringi_1.8.3               rprojroot_2.0.4             
[35] grid_4.3.2                  colorspace_2.1-0            
[37] cli_3.6.2                   SparseArray_1.2.2           
[39] magrittr_2.0.3              S4Arrays_1.2.0              
[41] utf8_1.2.4                  withr_2.5.2                 
[43] filelock_1.0.3              scales_1.3.0                
[45] timechange_0.2.0            XVector_0.42.0              
[47] reticulate_1.34.0           png_0.1-8                   
[49] hms_1.1.3                   GenomicRanges_1.54.1       
[51] IRanges_2.36.0              basilisk.utils_1.14.1      
[53] rlang_1.1.2                 Rcpp_1.0.11                 
[55] glue_1.6.2                  BiocGenerics_0.48.1        
[57] jsonlite_1.8.8              R6_2.5.1                    
[59] zlibbioc_1.48.0

Actual output (required):

v Initializing the model [32.2s]
[1] log likelihood: -108847548681.69 (perplexity: 14052.86)
[100] log likelihood: -95721577510.63 (perplexity: 4442.06)ETA: 11m
[200] log likelihood: -93863113203.29 (perplexity: 3773.68)ETA: 10m
[300] log likelihood: -93278637181.49 (perplexity: 3585.03)ETA:  9m
[400] log likelihood: -92909389512.57 (perplexity: 3470.74)ETA:  8m
[500] log likelihood: -92615573753.18 (perplexity: 3382.41)ETA:  7m
[600] log likelihood: -92455275564.52 (perplexity: 3335.17)ETA:  7m
[700] log likelihood: -92324715315.99 (perplexity: 3297.18)ETA:  6m
[800] log likelihood: -92207650271.44 (perplexity: 3263.48)ETA:  5m
[900] log likelihood: -92148011044.97 (perplexity: 3246.45)ETA:  5m
[1000] log likelihood: -92085611011.70 (perplexity: 3228.73)TA:  4m
[1100] log likelihood: -92014578934.45 (perplexity: 3208.67)TA:  3m
[1200] log likelihood: -91938402270.84 (perplexity: 3187.29)TA:  2m
Error: Something goes wrong in sample_lambda_slice().0% |  ETA:  2m

Expected output (it helps us a lot):

No error.

A minimal reproducible example or a reprex (it helps us a lot):

Would take some time to make the data available. Not sure how else to reproduce this error.

keyATM plots do not show Chinese characters

Hi I am new to keyATM, and the text I am trying to analyze is Chinese. I found that plots generated by keyATM do not show Chinese characters, as shown below. Do I miss some steps? Any help is appreciated.

All keywords are pruned. Please chec

Hi Shusei-E
Thanks for your reply, sorry I didn't see it ...
Well in your example just change the keywords list like this :

keywords <- list(Government = c("pol", "pal", "pil"),
Constitution = c("constitution", "rights"),
ForeignAffairs = c("foreign", "war", "missingword", "missing_word"))

visualize_keywords(docs = keyATM_docs, keywords = keywords)

and you will obtain:
Warning in check_keywords(unique(unlisted), keywords, prune) :
Keywords will be pruned because they do not appear in documents: pol, pal, pil, missingword, missing_word
Error in check_keywords(unique(unlisted), keywords, prune) :
All keywords are pruned. Please check: Government

Same I have if in any case you d'ont have any word in the topics I suppose
I know in the doc it's written it xan happen, but I don't know why you didn't protect from this possibility: if any words match in the topic ?

Hope this helps, regards
Rod

Originally posted by @rodtaq in #177 (comment)

document-topic distribution returns 0

Problem summary (required):

I ran KeyATM on a collection of survey responses I have. Each response is short and the total number of tokens is around 30000. I set the number of topics = 5. Would the small-N be the reason why out$theta returns 0 after I run below code?

out <- keyATM(docs = keyATM_ALL,       # text input of all 30000 tokens
              no_keyword_topics = 1,    # number of topics without keywords
              keywords = keywords,      # keywords
              model = "base",           # select the model
              options = list(seed = 250,
              store_theta = TRUE))

Environment information (required):

R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Catalina 10.15.5

Actual output (required):

out$theta is empty

Expected output (it helps us a lot):

I expect out$theta returns certain number, instead of being all 0

Missing information of the model setting

Dear authors,

keyATM is pretty cool! I like it and intend to use it in my research and introduce to my students. Nevertheless, there are several points in your User's Guide cofusing me and any help would be appreicate:

In the illustration of the dynamic extension, you mentioned "store_theta option should be TRUE to show 90% credible intervals." Then is there a way that I can show the 95% CIs?
Still in this section, you set the num_states to be 5. I guess this argument refers to the state in HMM. Am I right? Apologize for not being an expert on it, but why 5? Is it an arbitrary number or have some prior knowledge of that, or because of some other reason?
In the current released version, the model can be estimated only in one of the basic, covariate, and dynamics. Does that mean there is still no way to estimate a covariates+dynamic model?
In the extension of covariates, you illustrate how to show the effects of binary and factor covariates. Is there a way to show a continuous one? I can draw a plot of the posterior means at each value point of the continuous variable, just like presenting the marginal effects. But I wonder if there's a better way from your view.
Finally, I am confused about the Covariates and topic-word distribution. What are the keep= c("Z", "S") mean? And how to read the results?

Thank you!

New: Speed up

sample_z, sample_s: try pass-by-value with const as well (it could be faster than pass-by-reference. In a different branch, after writing tests.

Question: Hyperprior for alpha

I just pasted it as an image, because the formulas were not formatted correctly by GitHub:

And if that is correct, why did you opt for it?

And thanks again for developing such a nice software package :)

New: testthat

problem with dynamic topic modeling: free(): invalid next size (normal)

Problem summary (required):

This is an example to show how to report a bug.

I can't run the dynamic topic modeling specified as follows:

dynamic_out_day <- keyATM(docs = keyATM_docs,    # text input
                      no_keyword_topics = 2,              # number of topics without keywords
                      keywords = keywords,       # keywords
                      model = "dynamic",         # select the model
                      model_settings = list(time_index = docvars(my_corpus)$index,
                                            num_states = 5),
                      options = list(seed = 250, store_theta = TRUE, thinning = 5))

I assume it has something to do with the C side of the package, but don't know exactly what's going on. My current hunch is the error might be related to the size of the data the function can handle.

Environment information (required):

sessionInfo()  # please run this in R and copy&paste the output

R version 4.0.2 (2020-06-22)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/liblapack.so.3

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods  
[7] base     

other attached packages:
 [1] quanteda_2.1.0  forcats_0.5.0   stringr_1.4.0  
 [4] dplyr_1.0.0     purrr_0.3.4     readr_1.3.1    
 [7] tidyr_1.1.0     tibble_3.0.2    ggplot2_3.3.2  
[10] tidyverse_1.3.0 here_0.1        keyATM_0.3.0

Actual output (required):

# Please copy and paste the error message

R 
Initializing the model...
Fitting the model. 1500 iterations...
free(): invalid next size (normal)
Aborted (core dumped)

Expected output (it helps us a lot):

Please explain what you expected to happen.

When I ran the dynamic model with the month index, it worked (6 months). When I extended the time index to days (159 days), the function stopped working.

A minimal reproducible example or a reprex (it helps us a lot):

Please paste minimal code that reproduces the bug. If possible, please upload the data file as .rds.

Unfortunately, I cannot share the data publicly.

Wildcard character in keywords list

Hello
Is it possible to use a wildcard character in the keywords list?
For example, bank* (similar as the Seeded LDA)
Thank you

New: Migrate VB

Create keyATMvb() (as a separate function, using as much as keyATM())
Initialization: mcmc and random
New class name for the VB output
Update the function to convert model name
Update output functions if necessary
Write tests
Remove pass by reference and define variables in the function

Faster initialization with C++

Requested feature

make_sz_key with C++

Use case

Initialization.

Additional context

Please add any other information about the feature request.

change parallel::mclapply to future.apply::future_lapply in the initialization

Comment

This is not a feature request, and it is also not a bug.

In the initialization phase (line 394 of the model.R script) you use the parallel::mclapply function. This effectively prevents users from running the package in parallel on Windows machines. If you change this to future.apply::future_lapply Windows users would also be able to use the parallel_init option and it seems like you already use that function in the package.

It seems like only a few lines of code would need to be changed and I could not discern if there was a particular reason for using parallel::mclapply. I would be happy to do this if required, but since it seems like such a minor thing, you may just want to do it (if you agree that it is a good change).

Export latent state for dynamic version

Requested feature

Allows the latent state of each time step to be exported for customise plot.

# Example
value_figure(fig_timetrend)

# Maybe one more column called latent state
## # A tibble: 290 × 5
##    time_index Topic             Lower  Point  Upper
##         <int> <chr>             <dbl>  <dbl>  <dbl>
##  1       1789 1_Government     0.100  0.113  0.125 
##  2       1789 2_Congress       0.154  0.182  0.211 
##  3       1789 3_Peace          0.0441 0.0618 0.0800
##  4       1789 4_Constitution   0.215  0.232  0.254 
##  5       1789 5_ForeignAffairs 0.121  0.148  0.173 
##  6       1793 1_Government     0.209  0.247  0.287 
##  7       1793 2_Congress       0.0207 0.0699 0.0991
##  8       1793 3_Peace          0.0574 0.0876 0.112 
##  9       1793 4_Constitution   0.257  0.312  0.388 
## 10       1793 5_ForeignAffairs 0.0522 0.0819 0.112

Use case

Useful for state transition interpretation.

[Question] Extracting doc number/Checking perplexity of the output model

Hi.

First of all, thank you for making great package.
It has been really helpful for my arcademic research.

The thing that I'd like to ask about is extracting doc number with output model and checking final perplexity of the output model.
I have a news article data which is separated by news companies and I'd like to sort the docs out by a company with the topic number they have so that I can see the distribution of topics for the each news company.

What I have looked into so far is the function named "top_docs" with option "n".
I changed the "n" to print all the doc numbers belong exclusively to each topic but returned dataframe has duplicated doc number around the topics.
Will be there any way to extract the exclusive doc to topic number?
Also, please provide the way to check the perplexity of the output model.

Thank you.

Maximum No. of unique words in context of large corpora

Hi all,

I have a question with regard to the preparation of the dfm.
In the package description you highlight that one should "aim for 7,000 to 10,000 unique words at the maximum“.

However, I guess that this highly depends on the size of the whole corpus. In my case I am looking at a corpus of nearly 1Mio documents with more than 800,000 unique words. Trimming down this corpus to 7000-10,000 unique words would considerably reduce the complexity of the content of these documents.

Therefore, I wanted to ask why one should aim for max. 10,000 words and how one should deal with the case of such large corpora.

Thank you!

Option to keep .id or some other overt var to connect back to original df

Requested feature

It would be very useful to connect the outputted theta values to the original verbatims for modal classification purposes. Although currently you can grab theta from the keyATM_ouput object and can merge it back in to the original df, it is less than obvious and explicit.

Use case

This would be used to examine the probability of topic assignment for each document.

Thank you for a great package!

An issue inside `visualize_keywords`

Problem summary (required):

This line does not work with the newest version of dplyr (v1.0.0).
https://github.com/keyATM/keyATM/blob/master/R/model.R#L181
The summarise function will automatically ungroup by default (here). Although this functionality is labeled as experimental, it is included in the CRAN version and Travis returns an error.

How to fix (my understanding):

Delete line 181
Modify line 180 dplyr::summarize(WordCount = dplyr::n(), .group = .data)

Either case, we need to fix the Description. (dplyr >= 1.0.0)
https://github.com/keyATM/keyATM/blob/master/DESCRIPTION#L12

Environment information (required):

> sessionInfo()
R version 3.6.3 (2020-02-29)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS Sierra 10.12.6

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] quanteda_2.0.1 keyATM_0.3.0   testthat_2.3.2

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.4.6       magrittr_1.5       usethis_1.6.1      stopwords_2.0     
 [5] tidyselect_1.1.0   munsell_0.5.0      colorspace_1.4-1   lattice_0.20-38   
 [9] R6_2.4.1           rlang_0.4.6        fastmatch_1.1-0    fansi_0.4.1       
[13] stringr_1.4.0      dplyr_1.0.0        tools_3.6.3        grid_3.6.3        
[17] data.table_1.12.8  gtable_0.3.0       utf8_1.1.4         cli_2.0.2         
[21] ellipsis_0.3.1     assertthat_0.2.1   RcppParallel_5.0.1 tibble_3.0.1      
[25] lifecycle_0.2.0    crayon_1.3.4       Matrix_1.2-18      purrr_0.3.4       
[29] ggplot2_3.3.1      fs_1.4.1           vctrs_0.3.0        glue_1.4.1        
[33] stringi_1.4.6      compiler_3.6.3     pillar_1.4.4       generics_0.0.2    
[37] scales_1.1.1       pkgconfig_2.0.3

Actual output (required):

> p <- visualize_keywords(keyATM_docs, bills_keywords)
`summarise()` ungrouping output (override with `.groups` argument)
Error: `...` is not empty.

We detected these problematic arguments:
* `..1`

These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?

Please copy and paste the error message

`summarise()` ungrouping output (override with `.groups` argument)
Error: `...` is not empty.

We detected these problematic arguments:
* `..1`

These dots only exist to allow future extensions and should be empty.
Did you misspecify an argument?

Expected output (it helps us a lot):

visualize_keywords should run.

A minimal reproducible example or a reprex (it helps us a lot):

The code in the testthat

Please copy and paste the code. If possible, please upload the data file as `.rds`.

data(keyATM_data_bills)
bills_dfm <- keyATM_data_bills$doc_dfm
bills_keywords <- keyATM_data_bills$keywords
keyATM_docs <- keyATM_read(bills_dfm)
bills_cov <- keyATM_data_bills$cov
bills_time_index <- keyATM_data_bills$time_index
labels_use <- keyATM_data_bills$labels

p <- visualize_keywords(keyATM_docs, bills_keywords)

Export iteration number to .txt file

Requested feature

Hi folks, I'm wondering if you could add an option to export the iteration number to a .txt file while the model is converging.

Use case

I'm currently working on a project that must be completed on a virtual machine through Jupyter Notebooks. Basically, I'm analyzing newspapers on ProQuest, and they require all analysis to be done through their VM. However, there is a lot of data and the interface is very poor and doesn't show the progress bar while the code is running. I could get around this issue if the keyATM function could export the iteration number to a .txt file. I completely understand if this use case is too specific to justify editing the main function in the package.

A heads-up: Change to dependency matrixNormal

With much thanks to https://github.com/phargarten2/matrixNormal/issues/1, the Kronecker Product in matrixNormal::rmatnorm has been changed from koch(U,V) to koch(V,U). The original version used a citation in a paper that was found to be incorrect. I have updated the package (submitting it to CRAN for approval). I am sorry for any inconvenience.

Error in mapped$set(keys[x], values[x]) : key must be not be "" or NA

Hello, thanks for your library. I made some tests and I have the same error (windows, R 3.6) 👍

Warning in check_keywords(info$wd_names, keywords, options$prune) :
Keywords will be pruned because they do not appear in documents: "interest_rates",
"net_margins", "cash_margins", [... truncated]
Error in mapped$set(keys[x], values[x]) : key must be not be "" or NA

Is there any restriction on how write the Keyword ?
Thk in advance

Keyword:
Dictionary object with 2 key entries.

[Corpo]:
- m_a, legal, contracts, patents, settlements, operational_risk, regulations, sanction
[FxTax]:
- foreign_exchange, taxe, government_tariffs, tax_rate, fiscal

Text : any text

Results:
values_fig(key_viz)
A tibble: 13 x 5 Groups: Topic [2]
Word WordCount Proportion(%) Ranking Topic

1 contracts 12 0.002 1 1_Corpo
2 patents 3 0.001 2 1_Corpo
3 regulations 3 0.001 3 1_Corpo
4 legal 2 0 4 1_Corpo
5 settlements 1 0 5 1_Corpo
6 m_a NA NA 6 1_Corpo
7 operational_risk NA NA 7 1_Corpo
8 sanction NA NA 8 1_Corpo
9 fiscal 1890 0.339 1 2_FxTax
10 tax_rate 6 0.001 2 2_FxTax
11 foreign_exchange 1 0 3 2_FxTax
12 taxe NA NA 4 2_FxTax
13 government_tariffs NA NA 5 2_FxTax

Explanation of table symbols

For full clarity, it'd be great if you guys could explain what the symbols in this table on the site mean!

Cheers!

Dealing with large in-RAM data

I am repeatedly running into a complete "R session aborted" in RStudio while attempting at running the model. I have a fairly large dfm with ~80k documents and 50k features. I am expecting this sample to grow in both N and P.

Did you experienced any issue with large matrices?

A time_index problem

Hi I am new to keyATM, and I am learning keyATM Dynamic. My data are Chinese newspaper articles, and there are 3-5 news reports per day (see the screenshot below please). I followed instructions on your website to prepare time index, but got the following error message:
Error in check_arg_model_settings(obj, model, info) : model_settings$time_index does not increment by 1.

How do I fix this issue? Any help is appreciated!

Differing model output on Windows compared to other platforms

Problem summary (required):

Running the example model provided in the user guide, I am able to replicate the model output on Mac and Linux, but not on Windows, despite using the same seed. The model output is consistent between different Windows machines (at least on the three I have tried), but consistenly produces different results from the other platforms (see top_words() and top_docs() results pasted below).

Environment information (required):

## R version 4.0.3 (2020-10-10)
## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 19041)
## 
## Matrix products: default
## 
## locale:
## [1] LC_COLLATE=English_Germany.1252  LC_CTYPE=English_Germany.1252   
## [3] LC_MONETARY=English_Germany.1252 LC_NUMERIC=C                    
## [5] LC_TIME=English_Germany.1252    
## 
## attached base packages:
## [1] stats     graphics  grDevices utils     datasets  methods   base     
## 
## other attached packages:
## [1] magrittr_1.5   quanteda_2.1.2 keyATM_0.3.1  
## 
## loaded via a namespace (and not attached):
##  [1] Rcpp_1.0.5          pillar_1.4.6        compiler_4.0.3     
##  [4] tools_4.0.3         stopwords_2.0       digest_0.6.27      
##  [7] evaluate_0.14       lifecycle_0.2.0     tibble_3.0.4       
## [10] gtable_0.3.0        lattice_0.20-41     pkgconfig_2.0.3    
## [13] rlang_0.4.8         Matrix_1.2-18       fastmatch_1.1-0    
## [16] parallel_4.0.3      yaml_2.2.1          xfun_0.19          
## [19] fastmap_1.0.1       stringr_1.4.0       dplyr_1.0.2        
## [22] knitr_1.30          fs_1.5.0            generics_0.1.0     
## [25] vctrs_0.3.4         grid_4.0.3          tidyselect_1.1.0   
## [28] glue_1.4.2          data.table_1.13.2   R6_2.5.0           
## [31] rmarkdown_2.5       tidyr_1.1.2         ggplot2_3.3.2      
## [34] purrr_0.3.4         ISOcodes_2020.03.16 usethis_1.6.3      
## [37] scales_1.1.1        ellipsis_0.3.1      htmltools_0.5.0    
## [40] colorspace_2.0-0    stringi_1.5.3       RcppParallel_5.0.2 
## [43] munsell_0.5.0       crayon_1.3.4

Actual output (required):

For top_words(out):

##       1_Government          2_Congress            3_Peace
## 1         national          government   world [<U+2713>]
## 2  laws [<U+2713>]              people   peace [<U+2713>]
## 3   law [<U+2713>]              states                new
## 4           office               union             people
## 5           secure congress [<U+2713>] freedom [<U+2713>]
## 6            order           interests            america
## 7         republic              policy                let
## 8         business                made         government
## 9          respect      administration             nation
## 10        american             present               life
##             4_Constitution   5_ForeignAffairs   Other_1 Other_2       Other_3
## 1  constitution [<U+2713>]            country     power   great        public
## 2        rights [<U+2713>]              every     state nations     political
## 3                      now           citizens    powers  nation executive [1]
## 4                     duty             united   support    good        system
## 5                     free     war [<U+2713>]   general     men    confidence
## 6             institutions             spirit      well justice     necessary
## 7                 commerce             fellow     right    many           far
## 8                    trust foreign [<U+2713>] principle   first        duties
## 9                    honor               time      part purpose       federal
## 10                 citizen              years      high  action    prosperity
##     Other_4   Other_5
## 1      hope       one
## 2  american      make
## 3      know      much
## 4       day president
## 5  strength      just
## 6      need    always
## 7      land    better
## 8    things    others
## 9     power      home
## 10    earth     place

For top_docs(out):

##    1_Government 2_Congress 3_Peace 4_Constitution 5_ForeignAffairs Other_1
## 1            31         15      47             10                7      14
## 2            36         18      52              6                2      11
## 3            34         19      53              1               13      16
## 4            26         24      50             14               44      15
## 5            21         12      46              9                5      12
## 6            27         23      45             12                1       8
## 7            35          2      56              3                9       9
## 8            24          8      40             13               20       3
## 9            37         25      44             29               33      10
## 10           28         28      51             28                3       5
##    Other_2 Other_3 Other_4 Other_5
## 1       38      11      58      47
## 2       41      26      46      49
## 3       32      31      43      48
## 4       36      16      54      52
## 5       43      28      51      57
## 6        3      17      42      53
## 7        6      23      39      46
## 8       35       1      55      50
## 9       30       7      37      51
## 10      37       6      48      32

Expected output (it helps us a lot):

See base model output in the user guide, which I am able to replicate on other platforms. For example, the top three documents of topic 1_Government should be 9, 14, and 8.

A minimal reproducible example or a reprex (it helps us a lot):

library(keyATM)
library(quanteda)
library(magrittr)
data(data_corpus_inaugural, package = "quanteda")

data_tokens <- tokens(data_corpus_inaugural,
                      remove_numbers = TRUE,
                      remove_punct = TRUE,
                      remove_symbols = TRUE,
                      remove_separators = TRUE,
                      remove_url = TRUE) %>%
  tokens_tolower() %>%
  tokens_remove(c(stopwords("english"),
                  "may", "shall", "can",
                  "must", "upon", "with", "without")) %>%
  tokens_select(min_nchar = 3)

data_dfm <- dfm(data_tokens) %>%
  dfm_trim(min_termfreq = 5, min_docfreq = 2)

keyATM_docs <- keyATM_read(texts = data_dfm)
summary(keyATM_docs)

keywords <- list(Government     = c("laws", "law", "executive"),
                 Congress       = c("congress", "party"),
                 Peace          = c("peace", "world", "freedom"),
                 Constitution   = c("constitution", "rights"),
                 ForeignAffairs = c("foreign", "war"))


out <- keyATM(docs              = keyATM_docs,
              no_keyword_topics = 5,
              keywords          = keywords,
              model             = "base", 
              options           = list(seed = 250))

top_words(out)
top_docs(out)

Changing the transition matrix

I have a use-case of using the dynamic keyATM model. My transition matrix $P$ is restricted to a 2x2 matrix.
However, I would need the transition matrix to allow backward state switching. That is $P_{21}$ is non-zero.
I am thinking of changing the sampling procedure for $P$.
Do you think it is possible? Hope you can offer some advice on where are the lines in the source code that I should be aware of.
Thank you.

`hashmap` is no longer available

fastmap seems to be slower than hashmap. Write make_sz_key with C++ if needed. Keep old initialization as old.

top_words yields undocumented "Other" output

Problem summary (required):

As in the examples, I expected the top_words function to give me a tibble with as many columns as named elements in the list of keywords. I have four such elements, but I get 8 columns, the first four are as expected the remaining are labelled "Other_1", "Other_2", etc.

Environment information (required):

R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.1 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
 [1] LC_CTYPE=en_AU.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8       
 [4] LC_COLLATE=en_AU.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_AU.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] quanteda.textstats_0.95 keyATM_0.4.0            quanteda_3.2.1         
 [4] ggpubr_0.4.0            deeptime_0.2.1          zoo_1.8-9              
 [7] scales_1.1.1            ggtext_0.1.1            readabs_0.4.11         
[10] pbmcapply_1.5.0         tictoc_1.0.1            tidytable_0.6.7        
[13] lubridate_1.8.0         forcats_0.5.1           stringr_1.4.0          
[16] dplyr_1.0.7             purrr_0.3.4             readr_2.1.1            
[19] tidyr_1.1.4             tibble_3.1.6            ggplot2_3.3.5          
[22] tidyverse_1.3.1         reticulate_1.23        

loaded via a namespace (and not attached):
 [1] fs_1.5.2           httr_1.4.2         tools_4.0.3        backports_1.4.1   
 [5] utf8_1.2.2         R6_2.5.1           DBI_1.1.2          colorspace_2.0-2  
 [9] withr_2.4.3        tidyselect_1.1.1   gridExtra_2.3      compiler_4.0.3    
[13] cli_3.1.0          rvest_1.0.2        pacman_0.5.1       xml2_1.3.3        
[17] labeling_0.4.2     digest_0.6.29      rmarkdown_2.11     pkgconfig_2.0.3   
[21] htmltools_0.5.2    parallelly_1.30.0  dbplyr_2.1.1       fastmap_1.1.0     
[25] rlang_0.4.12       readxl_1.3.1       rstudioapi_0.13    farver_2.1.0      
[29] generics_0.1.1     jsonlite_1.7.2     car_3.0-12         magrittr_2.0.1    
[33] Matrix_1.4-0       Rcpp_1.0.8         munsell_0.5.0      fansi_1.0.2       
[37] ggfittext_0.9.1    abind_1.4-5        ggnewscale_0.4.5   lifecycle_1.0.1   
[41] yaml_2.2.1         stringi_1.7.6      carData_3.0-5      MASS_7.3-55       
[45] grid_4.0.3         listenv_0.8.0      crayon_1.4.2       lattice_0.20-45   
[49] haven_2.4.3        gridtext_0.1.4     hms_1.1.1          knitr_1.37        
[53] pillar_1.6.4       ggsignif_0.6.3     codetools_0.2-18   future.apply_1.8.1
[57] stopwords_2.3      fastmatch_1.1-3    reprex_2.0.1       glue_1.6.0        
[61] evaluate_0.14      data.table_1.14.2  RcppParallel_5.1.5 modelr_0.1.8      
[65] png_0.1-7          vctrs_0.3.8        tzdb_0.2.0         tweenr_1.0.2      
[69] cellranger_1.1.0   gtable_0.3.0       polyclip_1.10-0    future_1.23.0     
[73] assertthat_0.2.1   xfun_0.29          ggforce_0.3.3      broom_0.7.11      
[77] rstatix_0.7.0      nsyllable_1.0.1    globals_0.14.0     ellipsis_0.3.2

Actual output (required):

> names(keywords)
[1] "left"  "right" "gal"   "tan"  

---(following the model instructions for keyATMcovariates here)----

> names(top_words(res))
[1] "1_left"  "2_right" "3_gal"   "4_tan"   "Other_1" "Other_2" "Other_3" "Other_4"

Expected output:

I'd expect only the first four items to appear or the meaning of these "Other" values to be explained somewhere as it does not throw any errors.

A minimal reproducible example or a reprex:

Can't share the data at the moment.

Regex and Keywords

Hi authors!

Thank you a lot for this wonderful package, this is a great tool and it is helping me a lot with my research! I have a question/suggestion below, I am not a very experienced person on topic modelling but this is from my experience using the package so far.

Currently, I am exploring the package through a research project I have on Twitter data. However, I am slightly confused over the use of keywords and whether the keywords option support dictionary which contains regex. For instance, when I select "Terror*" (aiming for terrorism/terrorist/terror), I can see that visualize_keywords() thinks that the words are not in the corpora when it actually is. Thus, I am led to believe that KeyATM does not support keywords with regex.

Yet, this presents two problems, I think:

It complicates the use of keywords with lemmatized and stemmatized corpora, which from literature I have come to believe it is a gold standard in text analysis. That is because when I run the keywords on a lemmatized and stemmatized corpora, none of the keywords matches unless it is an exact match.
It requires us to come up with all the possible options for the keyword, which is also a problematic and not always reliable approach.

Maybe could there be added the option of allowing the use of regex expressions for keywords? Or if that is not possible, it may be a good idea to add it to the documentation and ways around that as well if you have any recommendations, as I was looking for that but I couldn't find it.

Hope this helps! Thanks again for the package!

Garbled characters issue during keyATM_read()

Problem summary (required):

Thank you so much for making this wonderful package! I have successfully replicated sample estimations shown in the package homepage. However, while I was trying to run keyATM_read() using Chinese text data, garbled characters issue raised in my environment. I have checked text encoding by as_utf8() and utf8_valid(), but it seems some issues happen during the keyATM_read().

I regularly use quanteda by using same Chinese data, and had no such trouble yet. I also found another user does not face this issue, using same text data, and running same code. So, I wonder my environment would have some issues. I guess this is primarily encoding issue and I should resolve it by myself but let me post here. For your information, my default text encoding in R is the UTF-8 (checked tool -> code -> saving in R).

Please let me know if you need more descriptions on my setting, script, and/or data.

Thank you!

Environment information (required):

sessionInfo()  # please run this in R and copy&paste the output
R version 4.1.2 (2021-11-01)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 22000)

Matrix products: default

locale:
[1] LC_COLLATE=Japanese_Japan.932  LC_CTYPE=Japanese_Japan.932  
[3] LC_MONETARY=Japanese_Japan.932 LC_NUMERIC=C                  
[5] LC_TIME=Japanese_Japan.932    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base    

other attached packages:
 [1] showtext_0.9-5  showtextdb_3.0  sysfonts_0.8.8  utf8_1.2.2    
 [5] lubridate_1.8.0 jiebaR_0.11     jiebaRD_0.1     readxl_1.4.1  
 [9] forcats_0.5.2   stringr_1.4.1   dplyr_1.0.10    purrr_0.3.4    
[13] readr_2.1.3     tidyr_1.2.1     tibble_3.1.8    ggplot2_3.4.0  
[17] tidyverse_1.3.2 keyATM_0.4.1    quanteda_3.2.3

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.9             lattice_0.20-45        assertthat_0.2.1      
 [4] R6_2.5.1               cellranger_1.1.0       backports_1.4.1      
 [7] reprex_2.0.2           httr_1.4.4             pillar_1.8.1          
[10] rlang_1.0.6            googlesheets4_1.0.1    rstudioapi_0.14      
[13] quanteda.corpora_0.9.2 Matrix_1.4-1           textshaping_0.3.6    
[16] googledrive_2.0.0      munsell_0.5.0          broom_1.0.1          
[19] compiler_4.1.2         modelr_0.1.9           systemfonts_1.0.4    
[22] pkgconfig_2.0.3        tidyselect_1.2.0       fansi_1.0.3          
[25] crayon_1.5.2           tzdb_0.3.0             dbplyr_2.2.1          
[28] withr_2.5.0            grid_4.1.2             jsonlite_1.8.2        
[31] gtable_0.3.1           lifecycle_1.0.3        DBI_1.1.3            
[34] magrittr_2.0.3         scales_1.2.1           RcppParallel_5.1.5    
[37] cli_3.4.1              stringi_1.7.6          fs_1.5.2              
[40] xml2_1.3.3             ragg_1.2.2             ellipsis_0.3.2        
[43] stopwords_2.3          generics_0.1.3         vctrs_0.5.1          
[46] fastmatch_1.1-3        RColorBrewer_1.1-3     tools_4.1.2          
[49] glue_1.6.2             hms_1.1.2              colorspace_2.0-3      
[52] gargle_1.2.1           rvest_1.0.3            haven_2.5.1

Actual output (required):

By checking the wd_names object in keyATM_docs after running keyATM_read(), I found garbled characters. A dfm is constructed by quanteda, and no garbled characters until keyATM_read(). I have checked topfeatures(dfm, 100) and found no issue just before the keyATM_read().

# Please copy and paste the error message
> keyATM_docs <- keyATM_read(dfm,
+                            encoding = "UTF-8",
+                            check = TRUE,
+                            keep_docnames = FALSE,
+                            progress_bar = FALSE,
+                            split = 0)
Using quanteda dfm.
 
> summary(keyATM_docs)
keyATM_docs object of: 983 documents.
Length of documents:
  Avg: 542.942
  Min: 48
  Max: 4706
   SD: 551.392
Number of unique words: 12616

keyATM_docs[["wd_names"]]
   [1] "蝮壼ｮ壻ｸ咲ｧｻ"                   "襍ｰ"                          
   [3] "荳ｭ蝗ｽ迚ｹ濶ｲ遉ｾ莨壻ｸｻ荵<89>"    "豕墓ｲｻ"                        
   [5] "驕楢ｷｯ"                         "蜈ｨ髱｢"                        
   [7] "謗ｨ霑<9b>"                      "萓晄ｳ墓ｲｻ蝗ｽ"

As the wd_names has trouble, visualize_keywords() does not work.

> key_viz <- visualize_keywords(docs = keyATM_docs, keywords = keywords)
Warning in check_keywords(unique(unlisted), keywords, prune) :
  Keywords will be pruned because they do not appear in documents: <U+7ECF><U+6D4E>, 商<U+4E1A>, <U+519B>
Error in check_keywords(unique(unlisted), keywords, prune) :
  All keywords are pruned. Please check: econ, military

Expected output (it helps us a lot):

I hope to have a basic result as shown in the Preparation page (https://keyatm.github.io/keyATM/articles/pkgdown_files/Preparation.html).

Thank you for your considerations!

keyatm / keyatm Goto Github PK

keyatm's Introduction

About

Website

keyatm's People

Contributors

Stargazers

Watchers

Forkers

keyatm's Issues

Problem summary (required):

Environment information (required):

Actual output (required):

Expected output (it helps us a lot):

Requested feature

Use case

Additional context

Progress Bar for keyATM

Use case

Requested feature

Requested feature

Problem summary (required):

Environment information (required):

Actual output (required):

Expected output (it helps us a lot):

A minimal reproducible example or a reprex (it helps us a lot):

Problem summary (required):

Environment information (required):

Actual output (required):

Expected output (it helps us a lot):

Problem summary (required):

Environment information (required):

Actual output (required):

Expected output (it helps us a lot):

A minimal reproducible example or a reprex (it helps us a lot):

Problem summary (required):

Environment information (required):

Actual output (required):

Expected output (it helps us a lot):

Problem summary (required):

Environment information (required):

Actual output (required):

Expected output (it helps us a lot):

A minimal reproducible example or a reprex (it helps us a lot):

Requested feature

Use case

Additional context

Comment

Requested feature

Use case

Requested feature

Use case

Problem summary (required):

Environment information (required):

Actual output (required):

Please copy and paste the error message

Expected output (it helps us a lot):

A minimal reproducible example or a reprex (it helps us a lot):

Please copy and paste the code. If possible, please upload the data file as .rds.

Requested feature

Use case

Problem summary (required):

Environment information (required):

Actual output (required):

Expected output (it helps us a lot):

A minimal reproducible example or a reprex (it helps us a lot):

Problem summary (required):

Environment information (required):

Actual output (required):

Expected output:

A minimal reproducible example or a reprex:

Problem summary (required):

Environment information (required):

Actual output (required):

Expected output (it helps us a lot):

Recommend Projects

Recommend Topics

Recommend Org

Please copy and paste the code. If possible, please upload the data file as `.rds`.