ropensci / cld3 Goto Github PK

View Code? Open in Web Editor NEW

41.0 9.0 5.0 710 KB

Bindings to Google's Compact Language Detector 3

Home Page: https://docs.ropensci.org/cld3

R 0.13% Shell 0.01% C++ 99.86%

cld cld3 language-detection language-detector r rstats r-package

cld3's Introduction

cld3

R Wrapper for Google's Compact Language Detector 3

Google's Compact Language Detector 3 is a neural network model for language identification and the successor of CLD2 (available from) CRAN. This version is still experimental and uses a novell algorithm with different properties and outcomes. For more information see: https://github.com/google/cld3#readme

Example

The function detect_language() is vectorised and guesses the the language of each string in text or returns NA if the language could not reliably be determined.

> library(cld3)
> example(cld3)

cld3> # Vectorized best guess
cld3> detect_language(c("To be or not to be?", "Ce n'est pas grave.", "猿も木から落ちる"))
[1] "en" "fr" "ja"

The function detect_language_multi() is not vectorised and detects all languages inside the entire character vector as a whole.

cld3> # Multiple languages in one text
cld3> detect_language_mixed("This piece of text is in English. Този текст е на Български.", size = 3)
  language probability reliable proportion
1       bg   0.9173891     TRUE  0.5853658
2       en   0.9999790     TRUE  0.4146341
3      und   0.0000000    FALSE  0.0000000

Installation

Binary packages for OS-X or Windows can be installed directly from CRAN:

install.packages("cld3")

Installation from source on Linux or OSX requires Google's Protocol Buffers library. On Debian or Ubuntu install libprotobuf-dev and protobuf-compiler:

sudo apt-get install -y libprotobuf-dev protobuf-compiler

On Fedora we need protobuf-devel:

sudo yum install protobuf-devel

On CentOS / RHEL we install [protobuf-devel](https://src.fedoraproject.org/rpms/protobuf via EPEL:

sudo yum install epel-release
sudo yum install protobuf-devel

On OS-X use protobuf from Homebrew:

brew install protobuf

cld3's People

Contributors

Stargazers

Watchers

Forkers

applied-statistic-using-r pateljemin nanaakwasiabayieboateng nathancday mcroche

cld3's Issues

detect_language_mixed(): R Session Crashing when running on empty entries

Hey!

I have a large dataset of mixed-language entries (assume 100k+) that I want to run cld3's language detection on in order to detect non-english language snippets. However, I was running into the problem with the R Session aborting (fatal error) as soon as I try to run it over certain entries. I could isolate the problem and it seems that as soon as it hit an empty entry ("") , it would fail and take the whole session down with it. cld2::detect_language_mixed and cld3::detect_language() both do not seem to have that issue, so I'm assuming it would be an easy fix to escape these entries and return NA. Seeing that it took me a while to figure out, it might save quite a bit of heartache to implement this in the next update though. I'm running the latest cld3 release from CRAN (1.4.1).

Also, thanks for the great package! It's really helpful seeing that it seems to deal better with multi-language entries than cld2.

Spanish manual language detection problems

Hi @jeroen!
Thank you so much for this development, it runs so smooth and it's so useful!
I have been doing some manual tagging for Spanish tags and have found some things that might be interesting but I am unsure if this would be useful for this wrapper and was wondering if you could point me towards the right direction.
For instance, from a list of conference titles, those in "Spanglish" got tagged as English w/cld2 and as Spanish with cld3, + while cld3 got real better at distinguishing Galician from Spanish there are more than a few times where it gets these tags wrong.
Can you think about someone who could benefit from my manually tagged dataset?

Can't install in ubuntu 20

Installation results in a screed of errors. However it seems protobuf is already the newest version so I have no idea what's wrong.

sudo apt-get install -y libprotobuf-dev protobuf-compiler
[sudo] password for ben: 
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libprotobuf-dev is already the newest version (3.12.4-1ubuntu2).
protobuf-compiler is already the newest version (3.12.4-1ubuntu2).

Ther error is like 1000 lines of this:

In file included from libcld3/feature_extractor.h:45,
                 from libcld3/embedding_feature_extractor.h:23,
                 from libcld3/nnet_language_identifier.h:22,
                 from wrapper.cpp:2:
./cld_3/protos/feature_extractor.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is
   12 | #error This file was generated by a newer version of protoc which is
      |  ^~~~~
./cld_3/protos/feature_extractor.pb.h:13:2: error: #error incompatible with your Protocol Buffer headers. Please update
   13 | #error incompatible with your Protocol Buffer headers. Please update
      |  ^~~~~
./cld_3/protos/feature_extractor.pb.h:14:2: error: #error your headers.
   14 | #error your headers.
      |  ^~~~~
In file included from libcld3/task_context.h:23,
                 from libcld3/feature_extractor.h:49,
                 from libcld3/embedding_feature_extractor.h:23,
                 from libcld3/nnet_language_identifier.h:22,
                 from wrapper.cpp:2:
./cld_3/protos/task_spec.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is
   12 | #error This file was generated by a newer version of protoc which is
      |  ^~~~~
./cld_3/protos/task_spec.pb.h:13:2: error: #error incompatible with your Protocol Buffer headers. Please update
   13 | #error incompatible with your Protocol Buffer headers. Please update
      |  ^~~~~
./cld_3/protos/task_spec.pb.h:14:2: error: #error your headers.
   14 | #error your headers.
      |  ^~~~~
In file included from libcld3/language_identifier_features.h:24,
                 from libcld3/nnet_language_identifier.h:25,
                 from wrapper.cpp:2:
./cld_3/protos/sentence.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is
   12 | #error This file was generated by a newer version of protoc which is
      |  ^~~~~
./cld_3/protos/sentence.pb.h:13:2: error: #error incompatible with your Protocol Buffer headers. Please update
   13 | #error incompatible with your Protocol Buffer headers. Please update
      |  ^~~~~
./cld_3/protos/sentence.pb.h:14:2: error: #error your headers.
   14 | #error your headers.
      |  ^~~~~

issue about detect incorrectlly

I try with many text of korean but CLD3 is unable to detect it.
with command: cld3::detect_language(text)

for example:
Korean text: "이 회의에서는 업계 전반의" => output: vi

English text: "hello world" => output: ky

how can CLD3 detect language more accurately?

thank you very much.

ropensci / cld3 Goto Github PK

cld3's Introduction

cld3

Example

Installation

cld3's People

Contributors

Stargazers

Watchers

Forkers

cld3's Issues

detect_language_mixed(): R Session Crashing when running on empty entries

Spanish manual language detection problems

Can't install in ubuntu 20

issue about detect incorrectlly

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent