Code Monkey home page Code Monkey logo

nametagger's People

Contributors

jwijffels avatar skvrnami avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

skvrnami

nametagger's Issues

Update README with an example

I cannot find any example that works on nametagger function. Can you update README file with an example that can be tried?

valgrind issue with ufal::nametag::utils::lzma::MatchFinder_Create

==1458363== Memcheck, a memory error detector
==1458363== Copyright (C) 2002-2022, and GNU GPL'd, by Julian Seward et al.
==1458363== Using Valgrind-3.21.0 and LibVEX; rerun with -h for copyright info
==1458363== Command: /data/blackswan/ripley/R/R-devel-vg/bin/exec/R --vanilla
==1458363== 

R Under development (unstable) (2023-08-14 r84947) -- "Unsuffered Consequences"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> pkgname <- "nametagger"
> source(file.path(R.home("share"), "R", "examples-header.R"))
> options(warn = 1)
> library('nametagger')
> 
> base::assign(".oldSearch", base::search(), pos = 'CheckExEnv')
> base::assign(".old_wd", base::getwd(), pos = 'CheckExEnv')
> cleanEx()
> nameEx("europeana_read")
> ### * europeana_read
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: europeana_read
> ### Title: Read Europeana Newspaper data
> ### Aliases: europeana_read
> 
> ### ** Examples
> 
> ## Don't show: 
> if(require(udpipe)){
+ ## End(Don't show)
+ library(udpipe)
+ r <- "https://raw.githubusercontent.com/EuropeanaNewspapers/ner-corpora/master"
+ ## Don't show: 
+ } # End of main if statement running only if the required packages are installed
Loading required package: udpipe
> ## End(Don't show)
> 
> 
> 
> cleanEx()

detaching ‘package:udpipe’

> nameEx("europeananews")
> ### * europeananews
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: europeananews
> ### Title: Tagged news paper articles from Europeana
> ### Aliases: europeananews
> 
> ### ** Examples
> 
> data(europeananews)
> str(europeananews)
'data.frame':	533893 obs. of  4 variables:
 $ doc_id     : chr  "enp_NL.kb.bio" "enp_NL.kb.bio" "enp_NL.kb.bio" "enp_NL.kb.bio" ...
 $ sentence_id: int  1 1 1 1 1 1 1 1 1 1 ...
 $ token      : chr  "Indien" "men" "Italië" "in" ...
 $ entity     : chr  "O" "O" "O" "O" ...
> 
> 
> 
> cleanEx()
> nameEx("nametagger")
> ### * nametagger
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: nametagger
> ### Title: Train a Named Entity Recognition Model using NameTag
> ### Aliases: nametagger
> 
> ### ** Examples
> 
> data(europeananews)
> x <- subset(europeananews, doc_id %in% "enp_NL.kb.bio")
> traindata <- subset(x, sentence_id >  100)
> testdata  <- subset(x, sentence_id <= 100)
> path <- "nametagger-nl.ner" 
> ## Don't show: 
> path <- tempfile("nametagger-nl_", fileext = ".ner")
> traindata <- subset(x, sentence_id >  100 & sentence_id < 300)
> testdata  <- subset(x, sentence_id <= 100)
> ## End(Don't show) 
> opts <- nametagger_options(file = path,
+                            token = list(window = 2),
+                            token_normalisedsuffix = list(window = 0, from = 1, to = 4),
+                            ner_previous = list(window = 2),
+                            time = list(use = TRUE),
+                            url_email = list(url = "URL", email = "EMAIL"))
> ## Don't show: 
> model <- nametagger(x.train = traindata, x.test = testdata,
+                     iter = 1, lambda = 0.5, control = opts)
==1458363== Conditional jump or move depends on uninitialised value(s)
==1458363==    at 0x17E945C7: ufal::nametag::utils::lzma::MatchFinder_Create(ufal::nametag::utils::lzma::CMatchFinder*, unsigned int, unsigned int, unsigned int, unsigned int, ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:540)
==1458363==    by 0x17E96768: LzmaEnc_Alloc (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:2984)
==1458363==    by 0x17E96768: ufal::nametag::utils::lzma::LzmaEnc_AllocAndInit(ufal::nametag::utils::lzma::CLzmaEnc*, unsigned int, ufal::nametag::utils::lzma::ISzAlloc*, ufal::nametag::utils::lzma::ISzAlloc*) [clone .constprop.0] (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3075)
==1458363==    by 0x17E96C1C: ufal::nametag::utils::lzma::LzmaEnc_MemEncode(void*, unsigned char*, unsigned long*, unsigned char const*, unsigned long, int, ufal::nametag::utils::lzma::ICompressProgress*, ufal::nametag::utils::lzma::ISzAlloc*, ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3269)
==1458363==    by 0x17E96D00: ufal::nametag::utils::lzma::LzmaEncode(unsigned char*, unsigned long*, unsigned char const*, unsigned long, ufal::nametag::utils::lzma::CLzmaEncProps const*, unsigned char*, unsigned long*, int, ufal::nametag::utils::lzma::ICompressProgress*, ufal::nametag::utils::lzma::ISzAlloc*, ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3293)
==1458363==    by 0x17E96DC3: ufal::nametag::utils::compressor::save(std::ostream&, ufal::nametag::utils::binary_encoder const&) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3320)
==1458363==    by 0x17E87DC6: ufal::nametag::entity_map::save(std::ostream&) const (packages/tests-vg/nametagger/src/nametag/src/ner/entity_map_encoder.cpp:24)
==1458363==    by 0x17E85846: ufal::nametag::bilou_ner_trainer::train(ufal::nametag::ner_ids::ner_id, int, ufal::nametag::network_parameters const&, ufal::nametag::tagger const&, std::istream&, std::istream&, std::istream&, std::ostream&) (packages/tests-vg/nametagger/src/nametag/src/ner/bilou_ner_trainer.cpp:71)
==1458363==    by 0x17E99241: nametag_train(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, double, double, double, double, int, bool, char const*) (packages/tests-vg/nametagger/src/rcpp_nametag.cpp:189)
==1458363==    by 0x17EA3C06: _nametagger_nametag_train (packages/tests-vg/nametagger/src/RcppExports.cpp:63)
==1458363==    by 0x4A3B59: R_doDotCall (svn/R-devel/src/main/dotcode.c:927)
==1458363==    by 0x4A4203: do_dotcall (svn/R-devel/src/main/dotcode.c:1551)
==1458363==    by 0x4DD026: bcEval (svn/R-devel/src/main/eval.c:7567)
==1458363==  Uninitialised value was created by a heap allocation
==1458363==    at 0x48432F3: operator new[](unsigned long) (/builddir/build/BUILD/valgrind-3.21.0/coregrind/m_replacemalloc/vg_replace_malloc.c:714)
==1458363==    by 0x17E961C7: ufal::nametag::utils::lzma::LzmaEnc_Create(ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:2769)
==1458363==    by 0x17E96C7B: ufal::nametag::utils::lzma::LzmaEncode(unsigned char*, unsigned long*, unsigned char const*, unsigned long, ufal::nametag::utils::lzma::CLzmaEncProps const*, unsigned char*, unsigned long*, int, ufal::nametag::utils::lzma::ICompressProgress*, ufal::nametag::utils::lzma::ISzAlloc*, ufal::nametag::utils::lzma::ISzAlloc*) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3283)
==1458363==    by 0x17E96DC3: ufal::nametag::utils::compressor::save(std::ostream&, ufal::nametag::utils::binary_encoder const&) (packages/tests-vg/nametagger/src/nametag/src/utils/compressor_save.cpp:3320)
==1458363==    by 0x17E87DC6: ufal::nametag::entity_map::save(std::ostream&) const (packages/tests-vg/nametagger/src/nametag/src/ner/entity_map_encoder.cpp:24)
==1458363==    by 0x17E85846: ufal::nametag::bilou_ner_trainer::train(ufal::nametag::ner_ids::ner_id, int, ufal::nametag::network_parameters const&, ufal::nametag::tagger const&, std::istream&, std::istream&, std::istream&, std::ostream&) (packages/tests-vg/nametagger/src/nametag/src/ner/bilou_ner_trainer.cpp:71)
==1458363==    by 0x17E99241: nametag_train(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, int, int, double, double, double, double, int, bool, char const*) (packages/tests-vg/nametagger/src/rcpp_nametag.cpp:189)
==1458363==    by 0x17EA3C06: _nametagger_nametag_train (packages/tests-vg/nametagger/src/RcppExports.cpp:63)
==1458363==    by 0x4A3B59: R_doDotCall (svn/R-devel/src/main/dotcode.c:927)
==1458363==    by 0x4A4203: do_dotcall (svn/R-devel/src/main/dotcode.c:1551)
==1458363==    by 0x4DD026: bcEval (svn/R-devel/src/main/eval.c:7567)
==1458363==    by 0x4F595F: Rf_eval (svn/R-devel/src/main/eval.c:1146)
==1458363== 
> ## End(Don't show)
> model
Nametag model saved at /tmp/RtmpC5CeFp/nametagger-nl_1640bb26cca642.ner
  size of the model in Mb: 0.05
  number of categories: 4
  category labels: LOC, ORG, URL, EMAIL
> model$stats
$iteration
[1] 1

$lr
[1] 0.1

$logprob
[1] -140.94

$accuracy_train
[1] 99.65

$accuracy_test
[1] 99.96

> plot(model$stats$iteration, model$stats$logprob, type = "b")
> plot(model$stats$iteration, model$stats$accuracy_train, type = "b", ylim = c(95, 100))
> lines(model$stats$iteration, model$stats$accuracy_test, type = "b", lty = 2, col = "red")
> ## Don't show: 
> if(require(udpipe)){
+ ## End(Don't show)
+ predict(model, 
+         "Ik heet Karel je kan me bereiken op [email protected] of www.duchanel.be", 
+         split = "[[:space:]]+")
+ ## Don't show: 
+ } # End of main if statement running only if the required packages are installed
Loading required package: udpipe
   doc_id sentence_id term_id             term  entity
1       1           1       1               Ik       O
2       1           1       1             heet       O
3       1           1       1            Karel       O
4       1           1       1               je       O
5       1           1       1              kan       O
6       1           1       1               me       O
7       1           1       1         bereiken       O
8       1           1       1               op       O
9       1           1       1 [email protected] B-EMAIL
10      1           1       1               of       O
11      1           1       1  www.duchanel.be   B-URL
> ## End(Don't show)
> 
> features <- system.file(package = "nametagger", 
+                         "models", "features_default.txt")
> cat(readLines(features), sep = "\n")
# Sentence processors
Form/2
Lemma/2
RawLemma/2
RawLemmaCapitalization/2
Tag/2
NumericTimeValue/1
> path_traindata <- "traindata.txt" 
> ## Don't show: 
> path_traindata <- tempfile("traindata_", fileext = ".txt")
> ## End(Don't show)
> write_nametagger(x, file = path_traindata)
> ## Don't show: 
> model <- nametagger(path_traindata, iter = 1, control = features, file = path)
> ## End(Don't show)
> 
> ## Don't show: 
> # clean up for CRAN
> file.remove(path)
[1] TRUE
> file.remove(path_traindata)
[1] TRUE
> ## End(Don't show)
> 
> 
> 
> cleanEx()

detaching ‘package:udpipe’

> nameEx("nametagger_download_model")
> ### * nametagger_download_model
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: nametagger_download_model
> ### Title: Download a Nametag model
> ### Aliases: nametagger_download_model
> 
> ### ** Examples
> 
> 
> 
> 
> cleanEx()
> nameEx("nametagger_load_model")
> ### * nametagger_load_model
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: nametagger_load_model
> ### Title: Load a Named Entity Recognition
> ### Aliases: nametagger_load_model
> 
> ### ** Examples
> 
> path  <- system.file(package = "nametagger", "models", "exampletagger.ner")
> model <- nametagger_load_model(path)
> model
Nametag model saved at /data/blackswan/ripley/R/packages/tests-vg/nametagger.Rcheck/nametagger/models/exampletagger.ner
  size of the model in Mb: 1.11
  number of categories: 3
  category labels: LOC, ORG, PER
> 
> 
> 
> cleanEx()
> nameEx("nametagger_options")
> ### * nametagger_options
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: nametagger_options
> ### Title: Define text transformations serving as predictive elements in
> ###   the nametagger model
> ### Aliases: nametagger_options
> 
> ### ** Examples
> 
> opts <- nametagger_options(token = list(window = 2))
> opts
## file: nametagger.ner
## type: generic
## tagger: trivial
Form/2 
> opts <- nametagger_options(time = list(use = TRUE, window = 3),
+                            token_capitalised = list(use = TRUE, window = 1),
+                            ner_previous = list(use = TRUE, window = 5))
> opts                            
## file: nametagger.ner
## type: generic
## tagger: trivial
FormCapitalization/1 
NumericTimeValue/3 
PreviousStage/5 
> opts <- nametagger_options(
+   lemma_capitalised = list(window = 3),
+   brown = list(window = 1, file = "path/to/brown/clusters/file.txt"),
+   gazetteers = list(window = 1, 
+                     file_loc = "path/to/txt/file1.txt", 
+                     file_time = "path/to/txt/file2.txt"))
> opts
## file: nametagger.ner
## type: generic
## tagger: trivial
RawLemmaCapitalization/3 
BrownClusters/1 path/to/brown/clusters/file.txt
Gazetteers/1 path/to/txt/file1.txt path/to/txt/file2.txt
> opts <- nametagger_options(
+   lemma_capitalised = list(window = 3),
+   brown = list(window = 2, 
+                file = "path/to/brown/clusters/file.txt"),
+   gazetteers_enhanced = list(
+     loc  = "LOC",  type_loc  = "form", save_loc  = "embed_in_model", file_loc  = "pathto/loc.txt",  
+     time = "TIME", type_time = "form", save_time = "embed_in_model", file_time = "pathto/time.txt")
+     )
> opts
## file: nametagger.ner
## type: generic
## tagger: trivial
RawLemmaCapitalization/3 
BrownClusters/2 path/to/brown/clusters/file.txt
GazetteersEnhanced LOC form embed_in_model pathto/loc.txt TIME form embed_in_model pathto/time.txt
> 
> 
> 
> cleanEx()
> nameEx("predict.nametagger")
> ### * predict.nametagger
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: predict.nametagger
> ### Title: Perform Named Entity Recognition on tokenised text
> ### Aliases: predict.nametagger
> 
> ### ** Examples
> 
> path  <- system.file(package = "nametagger", "models", "exampletagger.ner")
> model <- nametagger_load_model(path)
> model
Nametag model saved at /data/blackswan/ripley/R/packages/tests-vg/nametagger.Rcheck/nametagger/models/exampletagger.ner
  size of the model in Mb: 1.11
  number of categories: 3
  category labels: LOC, ORG, PER
> 
> x <- c("I ga naar Brussel op reis.", "Goed zo dat zal je deugd doen Karel")
> entities <- predict(model, x, split = "[[:space:][:punct:]]+")                          
> entities
   doc_id sentence_id term_id    term entity
1       1           1       1       I      O
2       1           1       1      ga      O
3       1           1       1    naar      O
4       1           1       1 Brussel  B-LOC
5       1           1       1      op      O
6       1           1       1    reis      O
7       2           2       1    Goed      O
8       2           2       1      zo      O
9       2           2       1     dat      O
10      2           2       1     zal      O
11      2           2       1      je      O
12      2           2       1   deugd      O
13      2           2       1    doen      O
14      2           2       1   Karel  B-PER
> 
> 
> 
> 
> cleanEx()
> nameEx("write_nametagger")
> ### * write_nametagger
> 
> flush(stderr()); flush(stdout())
> 
> ### Name: write_nametagger
> ### Title: Save a tokenised dataset as nametagger train data
> ### Aliases: write_nametagger
> 
> ### ** Examples
> 
> data(europeananews)
> x <- subset(europeananews, doc_id %in% "enp_NL.kb.bio")
> x <- head(x, n = 250)
> 
> path <- "traindata.txt" 
> ## Don't show: 
> path <- tempfile("traindata_", fileext = ".txt")
> ## End(Don't show)
> bio  <- write_nametagger(x, file = path)
> str(bio)
List of 2
 $ data: Named chr [1:10] "Indien\tO\nmen\tO\nItalië\tO\nin\tO\nzijn\tO\ngeheel\tO\nkon\tO\nneutraliseren\tO\n,\tO\ndan\tO\nzou\tO\ndit\tO"| __truncated__ "Op\tO\nde\tO\nwerven\tO\nte\tO\nChatham\tO\nheerscht\tO\ntegenwoordig\tO\nde\tO\nmeeste\tO\nbedrijvigheid\tO\ne"| __truncated__ "Er\tO\nzijn\tO\nvoor\tO\nhet\tO\noogenblik\tO\nte\tO\nChatham\tO\ndrie\tO\nlinieschepon.\tO\ntwee\tO\nfregatten"| __truncated__ "Ten\tO\neinde\tO\nde\tO\nwerkzaamhe\tO\nlen\tO\nte\tO\nbespoedigen,\tO\nzijn\tO\nonlangs\tO\nge\tO\nvangenen\tO"| __truncated__ ...
  ..- attr(*, "names")= chr [1:10] "enp_NL.kb.bio.1" "enp_NL.kb.bio.2" "enp_NL.kb.bio.3" "enp_NL.kb.bio.4" ...
 $ file: chr "/tmp/RtmpC5CeFp/traindata_1640bb6cce5482.txt"
 - attr(*, "class")= chr "nametagger_traindata"
> 
> ## Don't show: 
> # clean up for CRAN
> file.remove(path)
[1] TRUE
> ## End(Don't show)
> 
> 
> 
> ### * <FOOTER>
> ###
> cleanEx()
> options(digits = 7L)
> base::cat("Time elapsed: ", proc.time() - base::get("ptime", pos = 'CheckExEnv'),"\n")
Time elapsed:  543.503 5.023 552.115 0.01 0.085 
> grDevices::dev.off()
null device 
          1 
> ###
> ### Local variables: ***
> ### mode: outline-minor ***
> ### outline-regexp: "\\(> \\)?### [*]+" ***
> ### End: ***
> quit('no')
==1458363== 
==1458363== HEAP SUMMARY:
==1458363==     in use at exit: 264,371,985 bytes in 37,782 blocks
==1458363==   total heap usage: 7,156,226 allocs, 7,118,444 frees, 2,229,935,681 bytes allocated
==1458363== 
==1458363== LEAK SUMMARY:
==1458363==    definitely lost: 0 bytes in 0 blocks
==1458363==    indirectly lost: 0 bytes in 0 blocks
==1458363==      possibly lost: 0 bytes in 0 blocks
==1458363==    still reachable: 264,371,985 bytes in 37,782 blocks
==1458363==         suppressed: 0 bytes in 0 blocks
==1458363== Reachable blocks (those to which a pointer was found) are not shown.
==1458363== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==1458363== 
==1458363== For lists of detected and suppressed errors, rerun with: -s
==1458363== ERROR SUMMARY: 6 errors from 1 contexts (suppressed: 0 from 0)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.