Code Monkey home page Code Monkey logo

guess-language's People

guess-language's Issues

Exception for Unicode chars > 0xFFFF

What steps will reproduce the problem?
Unicode symbols from extended charset (ord(c) > 0xffff) cause exception.

Traceback (most recent call last):
  File "describe-channels.py", line 20, in <module>
    lang = guess_language.guessLanguage(" ".join(row.get('text', [])))
  File "/usr/local/lib/python2.6/dist-packages/guess_language/guess_language.py", line 300, in guessLanguage
    return _identify(text, find_runs(text))
  File "/usr/local/lib/python2.6/dist-packages/guess_language/guess_language.py", line 352, in find_runs
    block = unicodeBlock(c)
  File "/usr/local/lib/python2.6/dist-packages/guess_language/blocks.py", line 64, in unicodeBlock
    return _names[ix]
IndexError: list index out of range


Original issue reported on code.google.com by [email protected] on 17 Sep 2012 at 8:02

fails to detect Japanese sometimes; warning about effect of markup in the input

I'd like to draw your attention to
http://stackoverflow.com/questions/2164899/am-i-passing-the-string-correctly-to-
the-python-library

Two suggestions:

(1) When detecting Japanese language, the code should ADD the scores for
Katakana and Hiregana and compare the TOTAL against a threshold.

(2) Point out in the documentation that not stripping
HTML/XML/Javascript/whatever out of the input can result in the answer
being heavily biased towards an ASCII-only-script language (typically English).

Original issue reported on code.google.com by [email protected] on 5 Feb 2010 at 11:48

guess_language sort of kills multiprocess

What steps will reproduce the problem?
1. create 20 processes using multiprocess
2. in each of them: from guess_language import guessLanguage
3. make them run at the same time doing hard work;

What is the expected output? What do you see instead?

processes died around me like flies

What version of the product are you using? On what operating system?
windows vista 64bit, python 266

Please provide any additional information below.

Workaround
i made the problem disappear by doing:

from guess_language import * 

instead


Original issue reported on code.google.com by [email protected] on 16 Oct 2010 at 1:14

NLTK integration?

How about integrating this package with the Natural Language Toolkit? 
http://nltk.googlecode.com/

Original issue reported on code.google.com by StevenBird1 on 23 Feb 2009 at 5:03

Returning portuguese instead of spanish with simple sentences

What steps will reproduce the problem?
1. Remove restrictions of size from the source code
2. Do: guess_language.guessLanguageName("hola como estas")
3. It returns "portuguese", instead of "spanish". "hola como estas" in 
portuguese is "Olá, como vai você".

What is the expected output? What do you see instead?
Spanish, I see portuguese.

What version of the product are you using? On what operating system?
Latest from SVN.

Please provide any additional information below.


Original issue reported on code.google.com by [email protected] on 26 Jul 2010 at 4:11

missing setup.py

It would be nice if a) guess-language had a setup.py so one could make an
egg out of it and b) there was an egg uploaded to PyPI ('python setup.py
register sdist upload').

I've attached a minimal setup.py.

Original issue reported on code.google.com by wolfgang.schnerring on 4 May 2010 at 10:16

Extremeley slow on large files

What steps will reproduce the problem?
1. Read in a large file of varied UTF characters
2. Run guessLanguage on it
3. It takes forever

What is the expected output? What do you see instead?


What version of the product are you using? On what operating system?


Please provide any additional information below.
The library is designed to deal with small chunks of data, which is fine. 
However, in the case you feed it lots of data, it slows to a crawl.

This appears to be because of the nonAlphaRe call in normalize; the regex is 
thousands of characters long, and applied to every character in the data. 

A substantial speedup (100x or more) can be obtained by replacing the following 
call in normalize():
    u = nonAlphaRe.sub(' ', u)
with
    u = ''.join([ c.isalpha() and c or ' ' for c in u])
which I believe has the same effect.

Original issue reported on code.google.com by [email protected] on 13 Jul 2011 at 1:18

Text detection for russian language not works

What steps will reproduce the problem?

CODE:
user_text = u"привет Мир!"
language = guessLanguage(user_text)
print language # expected: ru, actual: UNKNOWN

Original issue reported on code.google.com by anton.danilchenko on 16 Jun 2012 at 10:32

Trigrams mismatch or naming problem

Hello and hat off for the nice port to Python!

I have a problem regarding the trigrams (or their naming conventions). I
just had a text in Vietnamese from the feed below and it gets detected as
"ha", "tl" and some other..

The online perl version gives me however 100% Vietnamese and when using the
trigrams from the perl package, it is also correct. Could it be that you've
overwritten something or what not when shortening the names?

Cheers, 
Martin

Feed link: http://www.tongti.net/?feed=rss2


Original issue reported on code.google.com by [email protected] on 24 Aug 2008 at 8:51

Guessing the language of this text causes a segfault

"""
>>> print text
सीवीसी द्वारा मल्टीनेशनल  पर 
जांच के आदेश

केन्द्रीय सतर्कता आयोग ने 
कोर्पोरेट भ्रष्टाचार के एक 
प्रकरण में डॉ आर एस
शुक्ल, संयुक्त सचिव एवं मुख्य 
सतर्कता अधिकारी, स्वास्थ्य 
एवं परिवार कल्याण,
भारत सरकार को जांच करने के 
आदेश दिए हैं. यह मामला स्विस 
मल्टीनेशनल
वेस्टरगार्द फ्रैंडसेन ग्रुप 
एसए के भारत स्थित सब्सिडीअरी 
वेस्टरगार्द
फ्रैंडसेन इंडिया लिमिटेड से 
सम्बंधित है. गवर्नेंस में 
पारदर्शिता के क्षेत्र
में कार्यरत लखनऊ स्थित संस्था 
नेशनल आरटीआई फोरम द्वारा 
केन्द्रीय सतर्कता
आयोग को  में पत्र लिखकर जांच और 
कार्यवाही की मांग की गई थी.

संस्था की कन्वेनर डॉ नूतन 
ठाकुर के मुताबिक, वेस्टरगार्द 
फ्रैंडसेन इंडिया
द्वारा भारत में विभिन्न 
राज्यों को काफी अलग-अलग दरों 
पर “लंबी-आयु कीटनाशक
मच्छरदानी” (एलएलआइएन) प्रदान 
किया गया है. एलएलआइएन का उपयोग 
आर्द्रतायुक्त,
वन क्षेत्रों में होता है जहाँ 
मच्छरों का भयावह प्रकोप होता 
है. भारत के बहुत
सारे राज्य सरकारों और तमाम 
अन्य सरकारी संस्थाओं द्वारा 
गरीब लोगों को
निशुल्क वितरित करने के लिए 
एलएलआइएन  ख़रीदे जाते हैं. 
वेस्टरगार्द फ्रैंडसेन
भारत एवं विश्व के कई देशों में 
एलएलआइएन के सबसे बड़े 
सप्लायरों में है.
यह कंपनी एलएलआइएन की सप्लाई 
या तो सीधे करती है या अपने 
मध्यस्थों के जरिये.
लेकिन इसके द्वारा सप्लाई किये 
गए एलएलआइएन की कीमतों में 
अलग-अलग जगह पर बहुत
भारी अंतर होता है जिससे यह 
साफ़ झलकता है कि इनके रेट तय 
करने में बेईमानी हुई
है. साथ ही इन दरों को तय करने 
में आवश्यक प्रक्रिया का पालन 
भी नहीं किया गया
है.

डॉ ठाकुर के अनुसार संस्था के 
पास उपलब्ध दस्तावेजों के 
अनुसार इनके दर रु०
199 प्रति ईकाई से रु० 400 तक नियत 
किये गए. यह पूरी तरह गलत है और 
दर्शाता है
कि किस प्रकार अत्यंत गरीब 
लोगों को मच्छरदानी वितरित 
करने के नाम पर
भ्रष्टाचार किया जा रहा है. 
ज्ञातव्य हो कि असम, जहाँ 
वेस्टरगार्द फ्रैंडसेन
द्वारा अपने मध्यस्थ मेसर्स 
ग्लोबल बिजिनेस प्राइवेट 
लिमिटेड, नयी दिल्ली के
माध्यम से रु० 400 प्रति 
मच्छरदानी के दर से एलएलआइएन 
सप्लाई किये गए हैं, के
मामले में उनके द्वारा 
कॉन्ट्रेक्ट पाने के पूर्व ही 
रुपये 295 प्रति ईकाई की
दर से वेस्टरगार्द फ्रैंडसेन 
से उतनी ही मच्छरदानी की खरीद 
का समझौता कर लिया
गया था, जो साफ़ दर्शाता है कि 
इसमें असम सरकार के 
अधिकारियों, वेस्टरगार्द
फ्रैंडसेन और ग्लोबल प्राइवेट 
की मिलीभगत थी.

केन्द्रीय सतर्कता आयोग ने डॉ 
शुक्ल को तीन महीने में जांच कर 
रिपोर्ट देने के
आदेश दिए हैं.







The Central Vigilance Commission has handed over the investigation of
alleged high level corporate corruption to Dr R S Shukla, Joint Secretary
and Chief Vigilance Officer of the Ministry of Health and Family Welfare,
Government of India. The matter relates to a Swiss Multinational company
Vestergaard Frandsen Group SA’s Indian subsidiary Vestergaard Frandsen
India Private limited. National RTI Forum, a Lucknow-based Civil society
working in the field of transparency in governance had written to the CVC
that Vestergaard Frandsen India have supplied “long lasting insecticide
treated bed net” (LLIN) to different states at completely varying rates.
LLIN is of great need in all humid, forest areas where mosquitoes are a
great menace and hence many States of India and many Public authorities
purchase these LLIN to be distributed among the poorest people. Vestergaard
Frandsen is among the largest suppliers of these LLIN in India as in many
other countries of the world.



Dr Nutan Thakur, convener of the RTI Forum has said that the company
supplies these LLIN either through direct contract or through some
intermediaries, but the rates of these LLIN in both the cases varies in a
very large range, which points to foul play. As per the records available,
the rate of supply of these LLIN has varied between Rs. 199 per unit to Rs.
400 per unit. No due process has been adopted in giving these orders. This
is completely unacceptable and shows the active connivance of the public
authorities and the company and its intermediaries in looting the State
fund in the name of free supply of mosquito nets to the poorest people. Dr
Thakur has also alleged that in case of Assam, the intermediary Ms Global
Business Services Private Limited, New Delhi purchased these LLIN from Ms
Vestergaard Frandsen India Private limited even before the date of contract
with the Assam government at much lower rate of Rs. 295 per unit and later
supplied the same to the Assam government at Rs. 400 per unit.



CVC, through its order has asked Dr Shukla to complete his investigation in
three months and hand over a report to the CVC.

Dr Nutan Thakur
#94155-34525

>>> import guess_language
>>> guess_language.guessLanguage(text)
Segmentation fault
"""

Version 0.2 (as on pypy), using Debian 6.

Original issue reported on code.google.com by [email protected] on 30 Mar 2012 at 11:03

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.