Code Monkey home page Code Monkey logo

cld2's Introduction

cld2

R Wrapper for Google's Compact Language Detector 2

Project Status: Active – The project has reached a stable, usable state and is being actively developed. Build Status AppVeyor Build Status Coverage Status CRAN_Status_Badge CRAN RStudio mirror downloads Github Stars

CLD2 probabilistically detects over 80 languages in Unicode UTF-8 text, either plain text or HTML/XML. For mixed-language input, CLD2 returns the top three languages found and their approximate percentages of the total text bytes (e.g. 80% English and 20% French out of 1000 bytes)

Installation

This package includes a bundled version of libcld2:

devtools::install_github("ropensci/cld2")

Guess a Language

The function detect_language() returns the best guess or NA if the language could not reliablity be determined.

cld2::detect_language("To be or not to be")
# [1] "ENGLISH"

cld2::detect_language("Ce n'est pas grave.")
# [1] "FRENCH"

cld2::detect_language("Nou breekt mijn klomp!")
# [1] "DUTCH"

Set plain_text = FALSE if your input contains HTML:

cld2::detect_language(url('http://www.un.org/ar/universal-declaration-human-rights/'), plain_text = FALSE)
# [1] "ARABIC"

cld2::detect_language(url('http://www.un.org/zh/universal-declaration-human-rights/'), plain_text = FALSE)
# [1] "CHINESE"

Use detect_language_multi() to get detailed classification output.

detect_language_multi(url('http://www.un.org/fr/universal-declaration-human-rights/'), plain_text = FALSE)
# $classification
#   language code latin proportion
# 1   FRENCH   fr  TRUE       0.96
# 2  ENGLISH   en  TRUE       0.03
# 3   ARABIC   ar FALSE       0.00
# 
# $bytes
# [1] 17008
# 
# $reliabale
# [1] TRUE

This shows the top 3 language guesses and the proportion of text that was classified as this language. The bytes attribute shows the total number of text bytes that was classified, and reliable is a complex calculation on if the #1 language is some amount more probable then the second-best Language.

cld2's People

Contributors

danbencol avatar jeroen avatar maelle avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

cld2's Issues

gl gets confused with es 100% of the times.

Hi there! Thank you so much for implementing this wrapper! It works so nicely on R. I was just wondering if there was anything we could do to improve its accuracy especially with languages that are very close to each other (for instance Galician -gl- and Spanish -es-)

I ran the detect_language function on around 2,000 titles and it "detected" around 60 as Galician and they were all in Spanish. Here's a sample of those misidentified titles:

  1. Los albores de la sinología occidental: la contribución de Fernández de Navarrete
  2. Herencia y cuidado: transiciones en la obligación filial
  3. Análisis de la tendencia e impacto de la mortalidad por causas externas: México, 2000-2013
  4. Significado, relevancia y elementos de género asociados al cuidado: metasíntesis cualitativa
  5. Algunas reflexiones metodológicas al abordar experiencias reproductivas de los varones desde las políticas públicas
  6. Bienestar financiero, una reflexión desde la ficción neoliberal en un contexto local
  7. Orientaciones para las políticas públicas de juventud: una revisión documental
  8. Muros y migración México-Estados Unidos
  9. Brecha salarial por género en México: Desde un enfoque regional, según su exposición a la apertura comercial 2005-2015
  10. Las Salvaguardias y su impacto en sector comercial de Ecuador

I was curious if you might know how the algorithm works and if there was any way to improve it :)

Different results from pycld2 binding

Hi pycld2 team!

Just wanted to let you know that @quinnanya used the pycld2 binding to detect language (see her csv) on the same Conference Titles for which I used your Rstats wrapper, and funnily enough we got different results, any idea why?
Using the Rstats wrapper I get more "french" tags:
32 of those "french" tags where marked as NA with Python, 8 w/un, 3 as en.
I manually checked those 3 pycld2 English tags that had a "fr" tag with the Rstats wrapper and one is indeed French, the other is Frenglish, and the last one is English

Segfault on some URLs

I'm not sure if this is a problem with the R package or the underlying library.

For some URLs, detect_language (and detect_language_multi) consistently cause a segfault.

> detect_language( url( "http://thehill.com" ), plain_text=FALSE )

 *** caught segfault ***
address 0x7f4dc86b66a4, cause 'invalid permissions'

Traceback:
 1: detect_language_cc(as_string(text), plain_text, lang_code)
 2: detect_language(url("http://thehill.com"), plain_text = FALSE)

I also get this error on http://qq.com.

Platform details:
R version 3.5.0 (2018-04-23) -- "Joy in Playing"
Copyright (C) 2018 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

Running under Arch Linux, and the most recent cld2 installed via devtools.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.