If you have any concerns, please let me know.
Latest blog posts:
A plain ‘Rcpp’ wrapper of ‘MeCab'
Home Page: https://paithiov909.github.io/gibasa/
License: GNU General Public License v3.0
If you have any concerns, please let me know.
Latest blog posts:
Quit using separate. Instead of tidyr::separate
, use readr::read_csv
for parsing feature strings.
Example:
col_select <- c(1,4,5)
into <- gibasa::get_dict_features()
res <- gibasa::gbs_tokenize("この木なんの木気になる木")
cols <-
dplyr::pull(res, feature) |>
paste0(collapse = "\n") |>
I() |>
readr::read_csv(col_names = FALSE, col_select = tidyselect::all_of(col_select), na = "*", progress = FALSE, show_col_types = FALSE)
into <- purrr::set_names(into, into)
colnames(cols) <- unname(into[col_select])
dplyr::bind_rows(dplyr::select(res, !.data$feature), cols)
#> doc_id sentence_id token_id token POS1 POS4 X5StageUse1
#> 1 1 1 1 この <NA> NA <NA>
#> 2 1 1 2 木 <NA> NA <NA>
#> 3 1 1 3 な <NA> NA <NA>
#> 4 1 1 4 ん <NA> NA <NA>
#> 5 1 1 5 の <NA> NA <NA>
#> 6 1 1 6 木 <NA> NA <NA>
#> 7 1 1 7 気 <NA> NA <NA>
#> 8 1 1 8 に <NA> NA <NA>
#> 9 1 1 9 なる <NA> NA <NA>
#> 10 1 1 10 木 <NA> NA <NA>
#> 11 <NA> <NA> NA <NA> 連体詞 NA <NA>
#> 12 <NA> <NA> NA <NA> 名詞 NA <NA>
#> 13 <NA> <NA> NA <NA> 助動詞 NA 特殊・ダ
#> 14 <NA> <NA> NA <NA> 名詞 NA <NA>
#> 15 <NA> <NA> NA <NA> 助詞 NA <NA>
#> 16 <NA> <NA> NA <NA> 名詞 NA <NA>
#> 17 <NA> <NA> NA <NA> 名詞 NA <NA>
#> 18 <NA> <NA> NA <NA> 助詞 NA <NA>
#> 19 <NA> <NA> NA <NA> 動詞 NA 五段・ラ行
#> 20 <NA> <NA> NA <NA> 名詞 NA <NA>
Created on 2022-04-08 by the reprex package (v2.0.1)
This issue lists Renovate updates and detected dependencies. Read the Dependency Dashboard docs to learn more.
This repository currently has no open or pending branches.
.github/workflows/R-CMD-check.yaml
actions/checkout v4
r-lib/actions v2
r-lib/actions v2
actions/setup-python v5
r-lib/actions v2
r-lib/actions v2
.github/workflows/pkgdown.yaml
actions/checkout v4
r-lib/actions v2
r-lib/actions v2
r-lib/actions v2
.github/workflows/rhub.yaml
r-hub/rhub2 v1
r-hub/rhub2 v1
r-hub/rhub2 v1
r-hub/rhub2 v1
r-hub/rhub2 v1
r-hub/rhub2 v1
r-hub/rhub2 v1
r-hub/rhub2 v1
r-hub/rhub2 v1
r-hub/rhub2 v1
Support tf="itf"
and idf="df"
in bind_tf_idf2
.
DF-ITF is calculated as:
It can be implemented as follows:
global_df <- function(sp) {
purrr::set_names(count_nnzero(sp) / nrow(sp), colnames(sp))
}
bind_tf_idf2 <- function(tbl,
term = "token",
document = "doc_id",
n = "n",
tf = c("tf", "tf2", "tf3", "itf"),
idf = c("idf", "idf2", "idf3", "idf4", "df"),
norm = FALSE,
rmecab_compat = TRUE) {
tf <- rlang::arg_match(tf)
idf <- rlang::arg_match(idf)
term <- enquo(term)
document <- enquo(document)
n_col <- enquo(n)
tbl <- dplyr::ungroup(tbl)
terms <- as.character(dplyr::pull(tbl, {{ term }}))
documents <- as.character(dplyr::pull(tbl, {{ document }}))
n <- dplyr::pull(tbl, {{ n_col }})
doc_totals <- tapply(
n, documents,
function(x) {
switch(tf,
tf = sum(x),
tf2 = log(x + 1),
tf3 = booled_freq(x),
itf = sum(x)
)
}
)
if (identical(tf, "tf")) {
tbl <- dplyr::mutate(tbl, tf = .data$n / as.numeric(doc_totals[documents]))
} else if (identical(tf, "itf")) {
tbl <- dplyr::mutate(tbl, tf = log(as.numeric(doc_totals[documents]) / .data$n))
} else {
tbl <- dplyr::mutate(tbl, tf = purrr::flatten_dbl(doc_totals))
}
if (isTRUE(rmecab_compat)) {
sp <- cast_sparse(tbl, !!document, !!term, "tf")
} else {
sp <- cast_sparse(tbl, !!document, !!term, !!n_col)
}
if (isTRUE(norm)) {
sp <- Matrix::t(Matrix::t(sp) * (1 / sqrt(Matrix::rowSums((sp * sp)))))
}
idf <- switch(idf,
idf = global_idf(sp),
idf2 = global_idf2(sp),
idf3 = global_idf3(sp),
idf4 = global_entropy(sp),
df = global_df(sp)
)
tbl <- dplyr::mutate(tbl,
idf = as.numeric(idf[terms]),
tf_idf = .data$tf * .data$idf
)
tbl
}
audubon
needs V8, which is broken for some platforms.
stringi::stri_split_boundaries
Line 13 in b5d4074
tokenize
will fail when all input sentences are blank.
A simple check before tokenizing prevents this behavior, but also kills performance...
if (all(is_blank(x))) {
rlang::abort("All elements of `x` are blank.")
}
これに合わせて変更する
tidytext::unnest_tokensはdoc_id,text以外の列についても保持するのでそれに合わせてもよさそう(あとから自分でjoinしてもらえばよい気もするが)
df <- data.frame(doc_id = c(1:3), text = audubon::polano[3:5], meta = c(4:6))
tidytext::unnest_tokens(df, token, text)
#> doc_id meta token
#> 1 1 4 前
#> 2 1 4 十七
#> 3 1 4 等
#> 4 1 4 官
#> 5 1 4 レ
#> 6 1 4 オー
#> 7 1 4 ノ
#> 8 1 4 キュー
#> 9 1 4 スト
#> 10 1 4 誌
#> 11 2 5 宮沢
#> 12 2 5 賢治
#> 13 2 5 訳述
#> 14 3 6 その
#> 15 3 6 ころ
#> 16 3 6 わたくし
#> 17 3 6 は
#> 18 3 6 モリーオ
#> 19 3 6 市
#> 20 3 6 の
#> 21 3 6 博物
#> 22 3 6 局
#> 23 3 6 に
#> 24 3 6 勤め
#> 25 3 6 て
#> 26 3 6 居
#> 27 3 6 り
#> 28 3 6 ま
#> 29 3 6 した
Created on 2022-04-10 by the reprex package (v2.0.1)
According to Wikipedia, probabilistic IDF can be calculated as follows:
where N
is total number of documents and n_t
is number of documents in which token t
appears.
If implementing this formula, gibasa:::global_idf3 looks like it should be written as:
global_idf3 <- function(sp) {
df <- gibasa:::count_nnzero(sp)
purrr::set_names(log2((nrow(sp) - df) / df), colnames(sp))
}
がんばればwarningを消せそう。以下を書き換える
❯ checking compiled code ... WARNING
File 'gibasa/libs/x64/gibasa.dll':
Found '_ZSt4cerr', possibly from 'std::cerr' (C++)
Objects: 'char_property.o', 'connector.o', 'context_id.o',
'dictionary.o', 'dictionary_rewriter.o', 'eval.o',
'feature_index.o', 'iconv_utils.o', 'lbfgs.o', 'learner.o',
'learner_tagger.o', 'tagger.o', 'tokenizer.o', 'utils.o'
Found '_ZSt4cout', possibly from 'std::cout' (C++)
Objects: 'connector.o', 'dictionary.o', 'eval.o',
'feature_index.o', 'learner.o', 'learner_tagger.o', 'param.o',
'tagger.o', 'viterbi.o'
Found 'exit', possibly from 'exit' (C), 'stop' (Fortran)
Objects: 'char_property.o', 'connector.o', 'context_id.o',
'dictionary.o', 'dictionary_rewriter.o', 'eval.o',
'feature_index.o', 'learner.o', 'learner_tagger.o', 'tagger.o',
'tokenizer.o', 'utils.o'
Compiled code should not call entry points which might terminate R nor
write to stdout/stderr instead of to the console, nor use Fortran I/O
nor system RNGs.
See 'Writing portable packages' in the 'Writing R Extensions' manual.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.