Code Monkey home page Code Monkey logo

ruimtehol's People

Contributors

everdark avatar jwijffels avatar kalibera avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ruimtehol's Issues

R CMD check FAIL on r-devel windows

Windows with gcc 12.2 (Rtools43), so in R-devel (to become R 4.3)

using R Under development (unstable) (2023-01-04 r83561 ucrt)
using platform: x86_64-w64-mingw32 (64-bit)
R was compiled by
    gcc.exe (GCC) 12.2.0
    GNU Fortran (GCC) 12.2.0
running under: Windows Server x64 (build 20348)
using session charset: UTF-8
checking for file 'ruimtehol/DESCRIPTION' ... OK
checking extension type ... Package
this is package 'ruimtehol' version '0.3'
package encoding: UTF-8
checking package namespace information ... OK
checking package dependencies ... OK
checking if this is a source package ... OK
checking if there is a namespace ... OK
checking for hidden files and directories ... OK
checking for portable file names ... OK
checking whether package 'ruimtehol' can be installed ... OK
used C++ compiler: 'g++.exe (GCC) 12.2.0'
checking installed package size ... NOTE
  installed size is 5.7Mb
  sub-directories of 1Mb or more:
    data 3.8Mb
    libs 1.4Mb
checking package directory ... OK
checking 'build' directory ... OK
checking DESCRIPTION meta-information ... OK
checking top-level files ... OK
checking for left-over files ... OK
checking index information ... OK
checking package subdirectories ... OK
checking R files for non-ASCII characters ... OK
checking R files for syntax errors ... OK
checking whether the package can be loaded ... [1s] OK
checking whether the package can be loaded with stated dependencies ... [1s] OK
checking whether the package can be unloaded cleanly ... [1s] OK
checking whether the namespace can be loaded with stated dependencies ... [1s] OK
checking whether the namespace can be unloaded cleanly ... [1s] OK
checking loading without being on the library search path ... [1s] OK
checking use of S3 registration ... OK
checking dependencies in R code ... OK
checking S3 generic/method consistency ... OK
checking replacement functions ... OK
checking foreign function calls ... OK
checking R code for possible problems ... [5s] OK
checking Rd files ... [1s] OK
checking Rd metadata ... OK
checking Rd cross-references ... OK
checking for missing documentation entries ... OK
checking for code/documentation mismatches ... OK
checking Rd \usage sections ... OK
checking Rd contents ... OK
checking for unstated dependencies in examples ... OK
checking contents of 'data' directory ... OK
checking data for non-ASCII characters ... [1s] OK
checking LazyData ... OK
checking data for ASCII and uncompressed saves ... OK
checking line endings in C/C++/Fortran sources/headers ... OK
checking line endings in Makefiles ... OK
checking compilation flags in Makevars ... OK
checking for GNU extensions in Makefiles ... OK
checking for portable use of $(BLAS_LIBS) and $(LAPACK_LIBS) ... OK
checking use of PKG_*FLAGS in Makefiles ... OK
checking include directives in Makefiles ... OK
checking pragmas in C/C++ headers and code ... OK
checking compiled code ... OK
checking sizes of PDF files under 'inst/doc' ... OK
checking installed files from 'inst/doc' ... OK
checking files in 'vignettes' ... OK
checking examples ...
Check process probably crashed or hung up for 20 minutes ... killed
Most likely this happened in the example checks (?),
if not, ignore the following last lines of example output:
>
> ## Don't show:
> if(require(udpipe)){
+ ## End(Don't show)
+ library(udpipe)
+ data(brussels_reviews_anno, package = "udpipe")
+ x <- subset(brussels_reviews_anno, language == "nl")
+ x$token <- x$lemma
+ x <- x[, c("doc_id", "sentence_id", "token")]
+ set.seed(123456789)
+ model <- embed_articlespace(x, early_stopping = 1,
+ dim = 25, epoch = 25, minCount = 2,
+ negSearchLimit = 1, maxNegSamples = 2)
+ plot(model)
+ sentences <- c("ook de keuken zijn zeer goed uitgerust .",
+ "het appartement zijn met veel smaak inrichten en zeer proper .")
+ predict(model, sentences, type = "embedding")
+ starspace_embedding(model, sentences)
+ ## Don't show:
+ } # End of main if statement running only if the required packages are installed
Loading required package: udpipe
Start to initialize starspace model.
Build dict from input file : D:\temp\RtmpCmjMu3\textspace_2a4984ad938f4.txt

Read 0M words
Number of words in dictionary: 1273
Number of labels in dictionary: 0
Loading data from file : D:\temp\RtmpCmjMu3\textspace_2a4984ad938f4.txt
Total number of examples loaded : 470
2023-01-05 12:58:12 Initialising with learning rate 0.01
======== End of example output (where/before crash/hang up occured ?) ========

What function to use when checking simmilarity between documents

Dear Jan,

First of all: thank you for this brilliant package! For me it has been very useful for textclassification tasks.

Now I have another problem at hand and I was wondering if ruimtehol could be of any help. I have a couple of hundred text documents. Is there a ruimtehol function that could help me find a ranking in simmilarity between these couple of documents and a completely new text document. So I have a new document and I want to check which documents have the highest simmilarity. My best guess was embed_articlespace(), but I couldn't find an example that steems to do exactly what I want. Is there an example somewhere or doesn't ruimtehol fit my research goal and do I have to take a look elsewhere? Many thanks in advance!

starspace test mode

Hi there. Thanks for developing this package. It's really great work.

I'm trying to figure out how we might use the starspace train functionality that's available with Starspace from within the R package.

To be more specific, I've created a model using model <- embed_tagspace(...) and would now like to run predictions on a hold out test set. I know it's possible to use the predict function to do this (have already done so) but in the starspace example shell scripts, they have the ability to do starspace test -model ... which spits out some automated test metrics. Is it possible to use this starspace functionality directly from within the R environment?

I can see from https://github.com/bnosac/ruimtehol/blob/master/src/rcpp_textspace.cpp that the test input argument is being created. I just can't quite figure out how to pass this to the embed_ or starspace functions. Any help would be greatly appreciated.

Thank you.

logs from ruimtehol on rhub:

check("../ruimtehol_0.2.tar.gz", platform = "macos-mavericks-oldrel")

#> Running `R CMD build`...

 207#> * checking for file ‘/Users/usereCWk4LeY/Rtemp/RtmpfqZK77/remotes12ee754fd806a/ruimtehol/DESCRIPTION’ ... OK

 208#> * preparing ‘ruimtehol’:

 209#> * checking DESCRIPTION meta-information ... OK

 210#> * cleaning src

 211#> * checking vignette meta-information ... OK

 212#> * checking for LF line-endings in source and make files

 213#> * checking for empty or unneeded directories

 214#> * looking to see if a ‘data/datalist’ file should be added

 215#> * building ‘ruimtehol_0.2.tar.gz’

 216#> Installing package into ‘/Users/usereCWk4LeY/R’

 217#> (as ‘lib’ is unspecified)

 218#> * installing *source* package ‘ruimtehol’ ...

 219#> ** libs

 220#> clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/usereCWk4LeY/R/Rcpp/include" -I"/Users/usereCWk4LeY/R/BH/include" -fPIC -Wall -mtune=core2 -g -O2 -c Starspace/src/utils/args.cpp -o Starspace/src/utils/args.o

 221#> clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/usereCWk4LeY/R/Rcpp/include" -I"/Users/usereCWk4LeY/R/BH/include" -fPIC -Wall -mtune=core2 -g -O2 -c Starspace/src/utils/normalize.cpp -o Starspace/src/utils/normalize.o

 222#> clang++ -std=c++11 -I/Library/Frameworks/R.framework/Resources/include -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I/usr/local/include -I/usr/local/include/freetype2 -I/opt/X11/include -I"/Users/usereCWk4LeY/R/Rcpp/include" -I"/Users/usereCWk4LeY/R/BH/include" -fPIC -Wall -mtune=core2 -g -O2 -c Starspace/src/utils/utils.cpp -o Starspace/src/utils/utils.o

 223#> In file included from Starspace/src/utils/utils.cpp:10:

 224#> Starspace/src/utils/utils.h:77:8: error: thread-local storage is unsupported for the current target

 225#> extern thread_local int id;

 226#> ^

 227#> Starspace/src/utils/utils.cpp:15:1: error: thread-local storage is unsupported for the current target

 228#> thread_local int id;

 229#> ^

 230#> 2 errors generated.

 231#> make: *** [Starspace/src/utils/utils.o] Error 1

 232#> ERROR: compilation failed for package ‘ruimtehol’

 233#> * removing ‘/Users/usereCWk4LeY/R/ruimtehol’

 234#> Warning message:

 235#> In i.p(...) :

 236#> installation of package ‘/Users/usereCWk4LeY/Rtemp/RtmpfqZK77/file12ee748cc5751/ruimtehol_0.2.tar.gz’ had non-zero exit status

check("../ruimtehol_0.2.1.tar.gz", platform = "macos-elcapitan-release")

─  Uploading package
─  Preparing build, see status at
   https://builder.r-hub.io/status/ruimtehol_0.2.1.tar.gz-d3a0cccb8b17442e8031d23972fd86f9
─  Build started
─  Creating user userKKhmoS52
─  Downloading package
─  Setting up home directory
─  Running check
─  Installing package dependencies
─  Running R CMD check
-  using log directory ‘/Users/userKKhmoS52/ruimtehol.Rcheck’ (1m 55.4s)
-  using R version 3.6.0 (2019-04-26)
-  using platform: x86_64-apple-darwin15.6.0 (64-bit)
-  using session charset: UTF-8
√  checking for file ‘ruimtehol/DESCRIPTION’
-  checking extension type ... Package
-  this is package ‘ruimtehol’ version ‘0.2.1’
-  package encoding: UTF-8
√  checking package namespace information
√  checking package dependencies (2.7s)
√  checking if this is a source package
√  checking if there is a namespace
√  checking for executable files
√  checking for hidden files and directories
√  checking for portable file names
√  checking for sufficient/correct file permissions
√  checking whether package ‘ruimtehol’ can be installed (2m 7s)
√  checking installed package size
√  checking package directory
√  checking ‘build’ directory
√  checking DESCRIPTION meta-information
√  checking top-level files
√  checking for left-over files
√  checking index information
√  checking package subdirectories
√  checking R files for non-ASCII characters
√  checking R files for syntax errors
√  checking whether the package can be loaded (2.6s)
√  checking whether the package can be loaded with stated dependencies
√  checking whether the package can be unloaded cleanly
√  checking whether the namespace can be loaded with stated dependencies
√  checking whether the namespace can be unloaded cleanly
√  checking loading without being on the library search path
√  checking dependencies in R code
√  checking S3 generic/method consistency (2.8s)
√  checking replacement functions
√  checking foreign function calls
√  checking R code for possible problems (2.9s)
√  checking Rd files
√  checking Rd metadata
√  checking Rd cross-references
√  checking for missing documentation entries
√  checking for code/documentation mismatches (3s)
√  checking Rd \usage sections
√  checking Rd contents
√  checking for unstated dependencies in examples
√  checking contents of ‘data’ directory
√  checking data for non-ASCII characters
√  checking data for ASCII and uncompressed saves
√  checking line endings in C/C++/Fortran sources/headers
√  checking line endings in Makefiles
√  checking compilation flags in Makevars
√  checking for GNU extensions in Makefiles
√  checking for portable use of $(BLAS_LIBS) and $(LAPACK_LIBS)
√  checking include directives in Makefiles
√  checking compiled code
√  checking sizes of PDF files under ‘inst/doc’
√  checking installed files from ‘inst/doc’
√  checking files in ‘vignettes’
√  checking examples (29.6s)
√  checking for unstated dependencies in vignettes
√  checking package vignettes in ‘inst/doc’
-  checking running R code from vignettes ...
      ‘ground-control-to-ruimtehol.Rnw’ using ‘UTF-8’ ... OK
    OK
√  checking re-building of vignette outputs (4m 23.4s)
√  checking PDF version of manual
   
─  Done with R CMD check
    ```

StarSpace Models in Shiny App

Is there a problem using the starspace_load_model function in Shiny? My app works locally, but when I deploy it I can't read the models...

randomisation

change randomisation from the C++ to the R side to get this on cran

Text Similarity

I am trying to calculate text similarity between sentences. I have standardized medical services list containing text of service ( for e.g. consultation of neurologist). Every time hospital/clinic comes with their own service list so I need to map hospital's service list with standardized service list. I calculate TF-IDF cosine similarity between hospital's service with standardized service list using skip-gram tokens. I have been doing this for long time so I also have correct mapping of services of some 15 hospitals. By 'correct mapping', I mean medical experts from my organization provided correct mapping of services which are wrongly labelled or mapped using tf-idf cosine similarity algorithm. I want to use 'correct mapping' as text classification problem but no. of labels in this case is more than 10K. Is there a way to perform 'Supervised text similarity'? I tried to use ruimtehol package with trainMode = 3 in starspace function for calculating similarity but got no success. Getting error "Please check: is the file empty? Do the examples contain proper feature and label according to the trainMode"

See the example of my datasets below ( consider A as 'standardized service list', B as 'hospital's service list', C as 'correct mapping') .

A <- data.frame(name= c("Patient had X-ray right leg arteries.",
                         "Subject was administered Rgraphy left shoulder",
                         "Exam consisted of x-ray leg arteries",
                         "Patient administered x-ray leg with 20km distance."),
                row.names = paste0("A", 1:4), stringsAsFactors = FALSE)
B <- data.frame(name= c(B = "Patient had X-ray left leg arteries",
                         "Rgraphy right shoulder given to patient",
                         "X-ray left shoulder revealed nothing sinister",
                         "Rgraphy right leg arteries tested"), 
                row.names = paste0("A", 1:4), stringsAsFactors = FALSE)

C <- data.frame(name= c("Patient had X-ray right leg arteries.",
                         "Subject was administered Rgraphy left shoulder",
                         "Exam consisted of x-ray leg arteries",
                         "Patient administered x-ray leg with 20km distance."),
                mapping = c("Radiography right leg artery.",
                            "Radiography left shoulder",
                            "Radiography leg arteries",
                            "Radiography leg with more than 10km distance."),
                row.names = paste0("A", 1:4), stringsAsFactors = FALSE)

See the sample code I am using for calculating similarity. It works when trainMode = 0 but not when it is set 3.

library(ruimtehol)
library(fastrtext)
data(train_sentences, package = "fastrtext")

filename <- tempfile()
writeLines(text = paste(paste0("__label__", train_sentences$class.text),  tolower(train_sentences$text)),
           con = filename)

model <- starspace(file = filename, 
                   trainMode = 0, label = "__label__", 
                   similarity = "dot", verbose = TRUE, initRandSd = 0.01, adagrad = FALSE, 
                   ngrams = 1, lr = 0.01, epoch = 5, thread = 20, dim = 10, negSearchLimit = 5, maxNegSamples = 3)
k =predict(model, "We developed a two-level machine learning approach that in the first level considers two different 
        properties important for protein-protein binding derived from structural models of V3 and V3 sequences.")  

k$prediction[1,]

I am open for suggestions in performing supervised text similarity. Any help would be highly appreciated!

difficulty in understanding starspace_embedding() behavior

I am trying to replicate the return value of starspace_embedding() function. Here is what I have found so far.

When training a model with ngrams = 1, starspace_embedding(model, 'word1 word2') = as.matrix(model)['word1', ] + as.matrix(model)['word2', ] normalized accordingly. However this doesn't hold when the model trained with ngrams > 1.

thanks in advance

transfer learning

give some examples on transfer learning in the vignette - it's currently completely undocumented

make function as_fasttext, using code from embed_tagspace

usefull if you want to build your own training / test dataset

as_fasttext <- function(x, y, label = "__label__"){
  if(is.list(y)){
    targets <- sapply(y, FUN=function(x){
      if(length(x) == 0 || all(is.na(x))){
        return(NA_character_)
      }
      paste(paste(label, x, sep = ""), collapse = " ") 
    })
  }else{
    targets <- ifelse(is.na(y), NA_character_, paste(label, y, sep = ""))
  }
  x <- ifelse(is.na(targets), x, paste(targets, x, sep = " "))
  x
}

Stack usage Error

Hi,

I installed the latest dev version and was running into an error. I did a clean install and tried to run the tagspace example but received a similar error message like

Error: C stack usage 17587557196884 is too close to the limit.

On another occasion, I did not get any error message but the process hangs after the first epoch and does not converge. Is there anything I can change in terms of memory of R versions, below is the session info. Thanks for helping out.

B.

Rsession info
`R version 3.3.2 (2016-10-31)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: macOS 10.13.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] ruimtehol_0.1 fastrtext_0.2.5

loaded via a namespace (and not attached):
[1] httr_1.3.1 assertthat_0.2.0 R6_2.2.2 tools_3.3.2 withr_2.1.2 curl_3.1 yaml_2.1.15 Rcpp_0.12.18
[9] memoise_1.1.0 codetools_0.2-15 git2r_0.21.0 digest_0.6.13 devtools_1.13.4
**C stack info**Cstack_info()
size current direction eval_depth
7969177 16280 1 2
`

Checkpointing: Continue model training at epoch x after saving intermediate model

Hi, first of all, many thanks for this outstanding package.

I have a question concerning model checkpointing: I have a fairly large corpus (~ 70M words) and run a model which calculates word embeddings (with embed_wordspace) with 10 epochs. I run this on a remote server and it can take up to 2 days for all 10 epochs to finish.

As a fault tolerance measure, I figured it might be a good idea to checkpoint the model after every epoch so in case something crashes, I can load the last saved epoch and continue training from there. For this, I set saveEveryEpoch = TRUE. Since I only want to save the last successful epoch, I keep saveTempModel = FALSE.

My question now is: How can I continue training from this checkpoint after something went wrong? I tried to pass initModel = "wordspace.bin" in the existing embed_wordspace call, which gives:

Start to load a trained starspace model.
STARSPACE-2017-2
Model loaded.

But, then it continues to run the model with the parameters specified in the overall call to embed_wordspace, starting at epoch 1 and seemingly ignoring the passed model. Also, when reading in the intermediate wordspace.bin.tsv, I'm left with the default parameters, not the one I passed in the function. For instance, x$args$param$epoch gives 5 (the default), while I originally passed epoch = 10:

x <- starspace_load_model("wordspace.bin.tsv", method = "tsv-data.table")
x$args$param$epoch
#> [1] 5

Could this be the cause of the problem?

Am I approaching this correctly? What would be an alternative way to achieve my desired goal? I'm thinking of something similar to the ModelCheckpoint functionality in TensorFlow.

Many thanks in advance!

unable to train wikipedia_shuf_train5M.txt

Here's my code and the error. Please help me resolve this issue. I have tried with trainMode=2, 3 and 5

library(ruimtehol)
set.seed(123456789)

setwd("D:/Software/StarSpace/scripts")

model <- starspace(file = "../data/wikipedia_shuf_train5M.txt",  fileFormat = "labelDoc", dim = 512, 
                                trainMode = 3, epoch=20)

Start to initialize starspace model.
Build dict from input file : ../data/wikipedia_shuf_train5M.txt
Read 2099M words
Number of words in dictionary:  10410937
Number of labels in dictionary: 0
Loading data from file : ../data/wikipedia_shuf_train5M.txt
Total number of examples loaded : 0
ERROR: File '../data/wikipedia_shuf_train5M.txt' does not contain any valid example.
Please check: is the file empty? Do the examples contain proper feature and label according to the trainMode? If your examples are unlabeled, try to set trainMode=5.
Error in (function (model = "textspace.bin", save = FALSE, trainFile = "",  : 
  Incorrect Starspace usage

non-virtual destructor

Running `R CMD build`...
* checking for file ‘/Users/userzerO6r5S/Rtemp/RtmpmR5WYV/remotesfce2e490d99/ruimtehol/DESCRIPTION’ ... OK
* preparing ‘ruimtehol’:
* checking DESCRIPTION meta-information ... OK
* cleaning src
* checking vignette meta-information ... OK
* checking for LF line-endings in source and make files and shell scripts
* checking for empty or unneeded directories
* building ‘ruimtehol_0.3.tar.gz’
Installing package into ‘/Users/userzerO6r5S/R’
(as ‘lib’ is unspecified)
* installing *source* package ‘ruimtehol’ ...
** using staged installation
** libs
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c Starspace/src/utils/args.cpp -o Starspace/src/utils/args.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c Starspace/src/utils/normalize.cpp -o Starspace/src/utils/normalize.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c Starspace/src/utils/utils.cpp -o Starspace/src/utils/utils.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c Starspace/src/data.cpp -o Starspace/src/data.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c Starspace/src/dict.cpp -o Starspace/src/dict.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c Starspace/src/doc_data.cpp -o Starspace/src/doc_data.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c Starspace/src/doc_parser.cpp -o Starspace/src/doc_parser.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c Starspace/src/model.cpp -o Starspace/src/model.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c Starspace/src/parser.cpp -o Starspace/src/parser.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c Starspace/src/proj.cpp -o Starspace/src/proj.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c Starspace/src/starspace.cpp -o Starspace/src/starspace.o
In file included from <built-in>:1:
In file included from ./compliance.h:3:
In file included from /Users/userzerO6r5S/R/Rcpp/include/Rcpp.h:27:
In file included from /Users/userzerO6r5S/R/Rcpp/include/RcppCommon.h:29:
In file included from /Users/userzerO6r5S/R/Rcpp/include/Rcpp/r/headers.h:67:
In file included from /Users/userzerO6r5S/R/Rcpp/include/Rcpp/platform/compiler.h:153:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/unordered_map:369:
In file included from /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/__hash_table:16:
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:3656:5: warning: destructor called on non-final 'starspace::DataParser' that has virtual functions but non-virtual destructor [-Wdelete-non-virtual-dtor]
    __data_.second().~_Tp();
    ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:3617:9: note: in instantiation of member function 'std::__1::__shared_ptr_emplace<starspace::DataParser, std::__1::allocator<starspace::DataParser> >::__on_zero_shared' requested here
        __shared_ptr_emplace(_Alloc __a, _Args&& ...__args)
        ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:4277:26: note: in instantiation of function template specialization 'std::__1::__shared_ptr_emplace<starspace::DataParser, std::__1::allocator<starspace::DataParser> >::__shared_ptr_emplace<std::__1::shared_ptr<starspace::Dictionary> &, std::__1::shared_ptr<starspace::Args> &>' requested here
    ::new(__hold2.get()) _CntrlBlk(__a2, _VSTD::forward<_Args>(__args)...);
                         ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:4656:29: note: in instantiation of function template specialization 'std::__1::shared_ptr<starspace::DataParser>::make_shared<std::__1::shared_ptr<starspace::Dictionary> &, std::__1::shared_ptr<starspace::Args> &>' requested here
    return shared_ptr<_Tp>::make_shared(_VSTD::forward<_Args>(__args)...);
                            ^
Starspace/src/starspace.cpp:35:15: note: in instantiation of function template specialization 'std::__1::make_shared<starspace::DataParser, std::__1::shared_ptr<starspace::Dictionary> &, std::__1::shared_ptr<starspace::Args> &>' requested here
    parser_ = make_shared<DataParser>(dict_, args_);
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:3656:23: note: qualify call to silence this warning
    __data_.second().~_Tp();
                      ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:3656:5: warning: destructor called on non-final 'starspace::LayerDataParser' that has virtual functions but non-virtual destructor [-Wdelete-non-virtual-dtor]
    __data_.second().~_Tp();
    ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:3617:9: note: in instantiation of member function 'std::__1::__shared_ptr_emplace<starspace::LayerDataParser, std::__1::allocator<starspace::LayerDataParser> >::__on_zero_shared' requested here
        __shared_ptr_emplace(_Alloc __a, _Args&& ...__args)
        ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:4277:26: note: in instantiation of function template specialization 'std::__1::__shared_ptr_emplace<starspace::LayerDataParser, std::__1::allocator<starspace::LayerDataParser> >::__shared_ptr_emplace<std::__1::shared_ptr<starspace::Dictionary> &, std::__1::shared_ptr<starspace::Args> &>' requested here
    ::new(__hold2.get()) _CntrlBlk(__a2, _VSTD::forward<_Args>(__args)...);
                         ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:4656:29: note: in instantiation of function template specialization 'std::__1::shared_ptr<starspace::LayerDataParser>::make_shared<std::__1::shared_ptr<starspace::Dictionary> &, std::__1::shared_ptr<starspace::Args> &>' requested here
    return shared_ptr<_Tp>::make_shared(_VSTD::forward<_Args>(__args)...);
                            ^
Starspace/src/starspace.cpp:37:15: note: in instantiation of function template specialization 'std::__1::make_shared<starspace::LayerDataParser, std::__1::shared_ptr<starspace::Dictionary> &, std::__1::shared_ptr<starspace::Args> &>' requested here
    parser_ = make_shared<LayerDataParser>(dict_, args_);
              ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:3656:23: note: qualify call to silence this warning
    __data_.second().~_Tp();
                      ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:3656:5: warning: destructor called on non-final 'starspace::InternDataHandler' that has virtual functions but non-virtual destructor [-Wdelete-non-virtual-dtor]
    __data_.second().~_Tp();
    ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:3617:9: note: in instantiation of member function 'std::__1::__shared_ptr_emplace<starspace::InternDataHandler, std::__1::allocator<starspace::InternDataHandler> >::__on_zero_shared' requested here
        __shared_ptr_emplace(_Alloc __a, _Args&& ...__args)
        ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:4277:26: note: in instantiation of function template specialization 'std::__1::__shared_ptr_emplace<starspace::InternDataHandler, std::__1::allocator<starspace::InternDataHandler> >::__shared_ptr_emplace<std::__1::shared_ptr<starspace::Args> &>' requested here
    ::new(__hold2.get()) _CntrlBlk(__a2, _VSTD::forward<_Args>(__args)...);
                         ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:4656:29: note: in instantiation of function template specialization 'std::__1::shared_ptr<starspace::InternDataHandler>::make_shared<std::__1::shared_ptr<starspace::Args> &>' requested here
    return shared_ptr<_Tp>::make_shared(_VSTD::forward<_Args>(__args)...);
                            ^
Starspace/src/starspace.cpp:63:12: note: in instantiation of function template specialization 'std::__1::make_shared<starspace::InternDataHandler, std::__1::shared_ptr<starspace::Args> &>' requested here
    return make_shared<InternDataHandler>(args_);
           ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:3656:23: note: qualify call to silence this warning
    __data_.second().~_Tp();
                      ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:3656:5: warning: destructor called on non-final 'starspace::LayerDataHandler' that has virtual functions but non-virtual destructor [-Wdelete-non-virtual-dtor]
    __data_.second().~_Tp();
    ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:3617:9: note: in instantiation of member function 'std::__1::__shared_ptr_emplace<starspace::LayerDataHandler, std::__1::allocator<starspace::LayerDataHandler> >::__on_zero_shared' requested here
        __shared_ptr_emplace(_Alloc __a, _Args&& ...__args)
        ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:4277:26: note: in instantiation of function template specialization 'std::__1::__shared_ptr_emplace<starspace::LayerDataHandler, std::__1::allocator<starspace::LayerDataHandler> >::__shared_ptr_emplace<std::__1::shared_ptr<starspace::Args> &>' requested here
    ::new(__hold2.get()) _CntrlBlk(__a2, _VSTD::forward<_Args>(__args)...);
                         ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:4656:29: note: in instantiation of function template specialization 'std::__1::shared_ptr<starspace::LayerDataHandler>::make_shared<std::__1::shared_ptr<starspace::Args> &>' requested here
    return shared_ptr<_Tp>::make_shared(_VSTD::forward<_Args>(__args)...);
                            ^
Starspace/src/starspace.cpp:65:12: note: in instantiation of function template specialization 'std::__1::make_shared<starspace::LayerDataHandler, std::__1::shared_ptr<starspace::Args> &>' requested here
    return make_shared<LayerDataHandler>(args_);
           ^
/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/include/c++/v1/memory:3656:23: note: qualify call to silence this warning
    __data_.second().~_Tp();
                      ^
4 warnings generated.
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c rcpp_textspace.cpp -o rcpp_textspace.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c compliance.cpp -o compliance.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -I"/Library/Frameworks/R.framework/Resources/include" -DNDEBUG -pthread -DSTRICT_R_HEADERS -DBOOST_NO_AUTO_PTR -include compliance.h -I./Starspace/src -I'/Users/userzerO6r5S/R/Rcpp/include' -I'/Users/userzerO6r5S/R/BH/include' -I/usr/local/include   -fPIC  -Wall -g -O2  -c RcppExports.cpp -o RcppExports.o
clang++ -mmacosx-version-min=10.13 -std=gnu++11 -dynamiclib -Wl,-headerpad_max_install_names -undefined dynamic_lookup -single_module -multiply_defined suppress -L/Library/Frameworks/R.framework/Resources/lib -L/usr/local/lib -o ruimtehol.so Starspace/src/utils/args.o Starspace/src/utils/normalize.o Starspace/src/utils/utils.o Starspace/src/data.o Starspace/src/dict.o Starspace/src/doc_data.o Starspace/src/doc_parser.o Starspace/src/model.o Starspace/src/parser.o Starspace/src/proj.o Starspace/src/starspace.o rcpp_textspace.o compliance.o RcppExports.o -pthread -F/Library/Frameworks/R.framework/.. -framework R -Wl,-framework -Wl,CoreFoundation
ld: warning: text-based stub file /System/Library/Frameworks//CoreFoundation.framework/CoreFoundation.tbd and library file /System/Library/Frameworks//CoreFoundation.framework/CoreFoundation are out of sync. Falling back to library file for linking.
rm -f Starspace/src/utils/args.o Starspace/src/utils/normalize.o Starspace/src/utils/utils.o Starspace/src/data.o Starspace/src/dict.o Starspace/src/doc_data.o Starspace/src/doc_parser.o Starspace/src/model.o Starspace/src/parser.o Starspace/src/proj.o Starspace/src/starspace.o rcpp_textspace.o compliance.o RcppExports.o
installing to /Users/userzerO6r5S/R/00LOCK-ruimtehol/00new/ruimtehol/libs
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** installing vignettes
** testing if installed package can be loaded from temporary location
** checking absolute paths in shared objects and dynamic libraries
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* creating tarball
packaged installation of ‘ruimtehol’ as ‘ruimtehol_0.3.tgz’
* DONE (ruimtehol)

Option of weighting words

Dear Jan,

Many thanks for this outstanding package.

I am learning the second example of the help file for ?embed_sentencespace and I have the following question:

When obtaining the sentence similarities, I am wondering if there is a way to weight the words that make up the sentence. For example, in sentence <- "Wat zijn de cijfers qua doorstroming van 2016? let's say that I would like to emphasize that the most important word to find the similar sentences is 'cijfers'.

Is it possible to assign a weight to tell the algorithm to try to orientate to sentences that contain 'cijfers'?

Looking at the package manual, I see that there are some arguments related to weighting, namely, wordWeight and useWeight, but I do not know how they must be used.

Any help would be very much appreciated.

Kind regards,

Guillermo

Problems with running ruimtehol on Windows R

Hi! Using ruimtehol on a Mac and happy with the result! Thank you. But then I tried to run the same script and the same training-label datasets on Windows. Something very strange, the result is completely different on PC. Looks like completely untrained. R version is same (4.0.3) , all packages like UDPipe are updated to last versions. What can be the problem ? (have tried dif 2 Mac and 2 PC).

Predict method returning duplicate results

Thanks for this package. It has really helped to integrate StarSpace in my workflow. My question is when I run predict on a model (from a model trained with trainingMode = 1) I get a nice dataframe with possible labels for my dataset. the data frame, however, contains duplicate results (e.g. same labels with same probabilities). Is this intended/due to StarSpace or an implementation feature/bug? Best Bob

Prediction of next word in a sentence

I was wondering, could it be possible to build a wordpredictor model with Ruimtehol? A prediction of the most likely next word when a sequence of words, meaning a part of a sentence, is given?

I was thinking of the label prediction algorithm (tagspace, if I am correct) . But then we should feed the model all possible parts of a sentence and all next words as labels. I am not sure if that's the way to go. Is there an easier way?

Many thanks in advance!

embed_tagspace produces different results within a session and when loaded (starspace_load_model) if ngrams is used

This strange behaviour happened in my project but I have tested it even with the dekamer example. Everything is fine untill the model (embed_tagspace) is trained specifing the ngrams parameters, saved and reloaded, no matter the method I use.
I noticed the inconsistencies in the predict results: the similarities obtained are on a different scale and the arrangement of the label scored is different.
I specify that the method used to save and load the trained model affects the predict results, but always in a non consistent way with the model object trained in session.

Running ruimtehol on R server

When I try to load a trained ruimtehol model on a remote server (where I run R), then this error message will appear:

Error in (function (model = "textspace.bin", save = FALSE, trainFile = "", : std::bad_alloc

Any ideas about what I could be doing wrong? Thanks!

Request for sentiment scoring example

Hello
It would be very helpful to provide an example for sentiment scoring.
I have to create a dictionary for sentiment analysis using txt_sentiment function.
Thank you

semi-supervised learning

need to handle missing data in labels more gracefully in embed_... wrappers to more easily allow semi-supervised learning

Word embeddings

Hi Jan, just a quick question, that maybe it's too basic but I'd like to be sure of the answer. It is about word embeddings and examine the typical example of london = paris - france + uk + england

With your package, would the right approach be? (assuming that x contains the data):

set.seed(123)
model <- embed_wordspace(x, early_stopping = 0.9, dim = 15, ws = 7, epoch = 10, minCount = 1, ngrams = 1)
plot(model)
word_vectors <- as.matrix(model)

mostsimilar <- embedding_similarity(word_vectors, word_vectors["paris", ] + word_vectors["france", ] + word_vectors["uk", ] + word_vectors["england", ])
head(sort(mostsimilar[, 1], decreasing = TRUE), 10)

Sentence separator for labelDoc format

Hi Jan,

I've been using ruimtehol for a while now, specifically the articlespace embedding functionality. However, I recently noticed that the embed_articlespace function produces an empty term in the dictionary of the StarSpace model.

Reproducible example:

library(ruimtehol)
library(udpipe)

data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language == "nl")
x$token <- x$lemma
x <- x[, c("doc_id", "sentence_id", "token")]
set.seed(123456789)
model <- embed_articlespace(x, early_stopping = 1,
                            dim = 25, epoch = 25, minCount = 2,
                            negSearchLimit = 1, maxNegSamples = 2)
dict <- starspace_dictionary(model)
# Empty string in dictionary?
dict$dictionary[1, ]

>>  term is_word is_label
>>1         TRUE    FALSE

When I change the sentence separator to \t (instead of <space>\t<space>), the empty term is not in the dictionary anymore.

Hence, should the sentence separator for the labelDoc format be surrounded with spaces or not?

Is it possible to exclude similarity of e.g. sentences when predicting?

Ruimtehol works like a charm. I use it to find similar articles based on words or sentences as input in the predict function.

I was wondering, could it be possible, or made possible, to not only find similarity, but also find similarity by taking into account the dissimilarity of certain words? E.g. find articles that are close to word together with a large distance to word2?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.