Code Monkey home page Code Monkey logo

cld3's Introduction

cld3

R Wrapper for Google's Compact Language Detector 3

Project Status: Active – The project has reached a stable, usable state and is being actively developed. CRAN RStudio mirror downloads

Google's Compact Language Detector 3 is a neural network model for language identification and the successor of CLD2 (available from) CRAN. This version is still experimental and uses a novell algorithm with different properties and outcomes. For more information see: https://github.com/google/cld3#readme

Example

The function detect_language() is vectorised and guesses the the language of each string in text or returns NA if the language could not reliably be determined.

> library(cld3)
> example(cld3)

cld3> # Vectorized best guess
cld3> detect_language(c("To be or not to be?", "Ce n'est pas grave.", "猿も木から落ちる"))
[1] "en" "fr" "ja"

The function detect_language_multi() is not vectorised and detects all languages inside the entire character vector as a whole.

cld3> # Multiple languages in one text
cld3> detect_language_mixed("This piece of text is in English. Този текст е на Български.", size = 3)
  language probability reliable proportion
1       bg   0.9173891     TRUE  0.5853658
2       en   0.9999790     TRUE  0.4146341
3      und   0.0000000    FALSE  0.0000000

Installation

Binary packages for OS-X or Windows can be installed directly from CRAN:

install.packages("cld3")

Installation from source on Linux or OSX requires Google's Protocol Buffers library. On Debian or Ubuntu install libprotobuf-dev and protobuf-compiler:

sudo apt-get install -y libprotobuf-dev protobuf-compiler

On Fedora we need protobuf-devel:

sudo yum install protobuf-devel

On CentOS / RHEL we install [protobuf-devel](https://src.fedoraproject.org/rpms/protobuf via EPEL:

sudo yum install epel-release
sudo yum install protobuf-devel

On OS-X use protobuf from Homebrew:

brew install protobuf

cld3's People

Contributors

jeroen avatar maelle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cld3's Issues

detect_language_mixed(): R Session Crashing when running on empty entries

Hey!

I have a large dataset of mixed-language entries (assume 100k+) that I want to run cld3's language detection on in order to detect non-english language snippets. However, I was running into the problem with the R Session aborting (fatal error) as soon as I try to run it over certain entries. I could isolate the problem and it seems that as soon as it hit an empty entry ("") , it would fail and take the whole session down with it. cld2::detect_language_mixed and cld3::detect_language() both do not seem to have that issue, so I'm assuming it would be an easy fix to escape these entries and return NA. Seeing that it took me a while to figure out, it might save quite a bit of heartache to implement this in the next update though. I'm running the latest cld3 release from CRAN (1.4.1).

Also, thanks for the great package! It's really helpful seeing that it seems to deal better with multi-language entries than cld2.

Spanish manual language detection problems

Hi @jeroen!
Thank you so much for this development, it runs so smooth and it's so useful!
I have been doing some manual tagging for Spanish tags and have found some things that might be interesting but I am unsure if this would be useful for this wrapper and was wondering if you could point me towards the right direction.
For instance, from a list of conference titles, those in "Spanglish" got tagged as English w/cld2 and as Spanish with cld3, + while cld3 got real better at distinguishing Galician from Spanish there are more than a few times where it gets these tags wrong.
Can you think about someone who could benefit from my manually tagged dataset?

Can't install in ubuntu 20

Installation results in a screed of errors. However it seems protobuf is already the newest version so I have no idea what's wrong.

sudo apt-get install -y libprotobuf-dev protobuf-compiler
[sudo] password for ben: 
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
libprotobuf-dev is already the newest version (3.12.4-1ubuntu2).
protobuf-compiler is already the newest version (3.12.4-1ubuntu2). 

Ther error is like 1000 lines of this:

In file included from libcld3/feature_extractor.h:45,
                 from libcld3/embedding_feature_extractor.h:23,
                 from libcld3/nnet_language_identifier.h:22,
                 from wrapper.cpp:2:
./cld_3/protos/feature_extractor.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is
   12 | #error This file was generated by a newer version of protoc which is
      |  ^~~~~
./cld_3/protos/feature_extractor.pb.h:13:2: error: #error incompatible with your Protocol Buffer headers. Please update
   13 | #error incompatible with your Protocol Buffer headers. Please update
      |  ^~~~~
./cld_3/protos/feature_extractor.pb.h:14:2: error: #error your headers.
   14 | #error your headers.
      |  ^~~~~
In file included from libcld3/task_context.h:23,
                 from libcld3/feature_extractor.h:49,
                 from libcld3/embedding_feature_extractor.h:23,
                 from libcld3/nnet_language_identifier.h:22,
                 from wrapper.cpp:2:
./cld_3/protos/task_spec.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is
   12 | #error This file was generated by a newer version of protoc which is
      |  ^~~~~
./cld_3/protos/task_spec.pb.h:13:2: error: #error incompatible with your Protocol Buffer headers. Please update
   13 | #error incompatible with your Protocol Buffer headers. Please update
      |  ^~~~~
./cld_3/protos/task_spec.pb.h:14:2: error: #error your headers.
   14 | #error your headers.
      |  ^~~~~
In file included from libcld3/language_identifier_features.h:24,
                 from libcld3/nnet_language_identifier.h:25,
                 from wrapper.cpp:2:
./cld_3/protos/sentence.pb.h:12:2: error: #error This file was generated by a newer version of protoc which is
   12 | #error This file was generated by a newer version of protoc which is
      |  ^~~~~
./cld_3/protos/sentence.pb.h:13:2: error: #error incompatible with your Protocol Buffer headers. Please update
   13 | #error incompatible with your Protocol Buffer headers. Please update
      |  ^~~~~
./cld_3/protos/sentence.pb.h:14:2: error: #error your headers.
   14 | #error your headers.
      |  ^~~~~

issue about detect incorrectlly

I try with many text of korean but CLD3 is unable to detect it.
with command: cld3::detect_language(text)

for example:
Korean text: "이 회의에서는 업계 전반의" => output: vi

English text: "hello world" => output: ky

how can CLD3 detect language more accurately?

thank you very much.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.