thekindlyone / guess-language Goto Github PK
View Code? Open in Web Editor NEWAutomatically exported from code.google.com/p/guess-language
License: GNU Lesser General Public License v2.1
Automatically exported from code.google.com/p/guess-language
License: GNU Lesser General Public License v2.1
What steps will reproduce the problem?
Unicode symbols from extended charset (ord(c) > 0xffff) cause exception.
Traceback (most recent call last):
File "describe-channels.py", line 20, in <module>
lang = guess_language.guessLanguage(" ".join(row.get('text', [])))
File "/usr/local/lib/python2.6/dist-packages/guess_language/guess_language.py", line 300, in guessLanguage
return _identify(text, find_runs(text))
File "/usr/local/lib/python2.6/dist-packages/guess_language/guess_language.py", line 352, in find_runs
block = unicodeBlock(c)
File "/usr/local/lib/python2.6/dist-packages/guess_language/blocks.py", line 64, in unicodeBlock
return _names[ix]
IndexError: list index out of range
Original issue reported on code.google.com by [email protected]
on 17 Sep 2012 at 8:02
I'd like to draw your attention to
http://stackoverflow.com/questions/2164899/am-i-passing-the-string-correctly-to-
the-python-library
Two suggestions:
(1) When detecting Japanese language, the code should ADD the scores for
Katakana and Hiregana and compare the TOTAL against a threshold.
(2) Point out in the documentation that not stripping
HTML/XML/Javascript/whatever out of the input can result in the answer
being heavily biased towards an ASCII-only-script language (typically English).
Original issue reported on code.google.com by [email protected]
on 5 Feb 2010 at 11:48
What steps will reproduce the problem?
1. create 20 processes using multiprocess
2. in each of them: from guess_language import guessLanguage
3. make them run at the same time doing hard work;
What is the expected output? What do you see instead?
processes died around me like flies
What version of the product are you using? On what operating system?
windows vista 64bit, python 266
Please provide any additional information below.
Workaround
i made the problem disappear by doing:
from guess_language import *
instead
Original issue reported on code.google.com by [email protected]
on 16 Oct 2010 at 1:14
How about integrating this package with the Natural Language Toolkit?
http://nltk.googlecode.com/
Original issue reported on code.google.com by StevenBird1
on 23 Feb 2009 at 5:03
What steps will reproduce the problem?
1. Remove restrictions of size from the source code
2. Do: guess_language.guessLanguageName("hola como estas")
3. It returns "portuguese", instead of "spanish". "hola como estas" in
portuguese is "Olá, como vai você".
What is the expected output? What do you see instead?
Spanish, I see portuguese.
What version of the product are you using? On what operating system?
Latest from SVN.
Please provide any additional information below.
Original issue reported on code.google.com by [email protected]
on 26 Jul 2010 at 4:11
It would be nice if a) guess-language had a setup.py so one could make an
egg out of it and b) there was an egg uploaded to PyPI ('python setup.py
register sdist upload').
I've attached a minimal setup.py.
Original issue reported on code.google.com by wolfgang.schnerring
on 4 May 2010 at 10:16
What steps will reproduce the problem?
1. Read in a large file of varied UTF characters
2. Run guessLanguage on it
3. It takes forever
What is the expected output? What do you see instead?
What version of the product are you using? On what operating system?
Please provide any additional information below.
The library is designed to deal with small chunks of data, which is fine.
However, in the case you feed it lots of data, it slows to a crawl.
This appears to be because of the nonAlphaRe call in normalize; the regex is
thousands of characters long, and applied to every character in the data.
A substantial speedup (100x or more) can be obtained by replacing the following
call in normalize():
u = nonAlphaRe.sub(' ', u)
with
u = ''.join([ c.isalpha() and c or ' ' for c in u])
which I believe has the same effect.
Original issue reported on code.google.com by [email protected]
on 13 Jul 2011 at 1:18
What steps will reproduce the problem?
CODE:
user_text = u"привет Мир!"
language = guessLanguage(user_text)
print language # expected: ru, actual: UNKNOWN
Original issue reported on code.google.com by anton.danilchenko
on 16 Jun 2012 at 10:32
Hello and hat off for the nice port to Python!
I have a problem regarding the trigrams (or their naming conventions). I
just had a text in Vietnamese from the feed below and it gets detected as
"ha", "tl" and some other..
The online perl version gives me however 100% Vietnamese and when using the
trigrams from the perl package, it is also correct. Could it be that you've
overwritten something or what not when shortening the names?
Cheers,
Martin
Feed link: http://www.tongti.net/?feed=rss2
Original issue reported on code.google.com by [email protected]
on 24 Aug 2008 at 8:51
"""
>>> print text
सीवीसी द्वारा मल्टीनेशनल पर
जांच के आदेश
केन्द्रीय सतर्कता आयोग ने
कोर्पोरेट भ्रष्टाचार के एक
प्रकरण में डॉ आर एस
शुक्ल, संयुक्त सचिव एवं मुख्य
सतर्कता अधिकारी, स्वास्थ्य
एवं परिवार कल्याण,
भारत सरकार को जांच करने के
आदेश दिए हैं. यह मामला स्विस
मल्टीनेशनल
वेस्टरगार्द फ्रैंडसेन ग्रुप
एसए के भारत स्थित सब्सिडीअरी
वेस्टरगार्द
फ्रैंडसेन इंडिया लिमिटेड से
सम्बंधित है. गवर्नेंस में
पारदर्शिता के क्षेत्र
में कार्यरत लखनऊ स्थित संस्था
नेशनल आरटीआई फोरम द्वारा
केन्द्रीय सतर्कता
आयोग को में पत्र लिखकर जांच और
कार्यवाही की मांग की गई थी.
संस्था की कन्वेनर डॉ नूतन
ठाकुर के मुताबिक, वेस्टरगार्द
फ्रैंडसेन इंडिया
द्वारा भारत में विभिन्न
राज्यों को काफी अलग-अलग दरों
पर “लंबी-आयु कीटनाशक
मच्छरदानी” (एलएलआइएन) प्रदान
किया गया है. एलएलआइएन का उपयोग
आर्द्रतायुक्त,
वन क्षेत्रों में होता है जहाँ
मच्छरों का भयावह प्रकोप होता
है. भारत के बहुत
सारे राज्य सरकारों और तमाम
अन्य सरकारी संस्थाओं द्वारा
गरीब लोगों को
निशुल्क वितरित करने के लिए
एलएलआइएन ख़रीदे जाते हैं.
वेस्टरगार्द फ्रैंडसेन
भारत एवं विश्व के कई देशों में
एलएलआइएन के सबसे बड़े
सप्लायरों में है.
यह कंपनी एलएलआइएन की सप्लाई
या तो सीधे करती है या अपने
मध्यस्थों के जरिये.
लेकिन इसके द्वारा सप्लाई किये
गए एलएलआइएन की कीमतों में
अलग-अलग जगह पर बहुत
भारी अंतर होता है जिससे यह
साफ़ झलकता है कि इनके रेट तय
करने में बेईमानी हुई
है. साथ ही इन दरों को तय करने
में आवश्यक प्रक्रिया का पालन
भी नहीं किया गया
है.
डॉ ठाकुर के अनुसार संस्था के
पास उपलब्ध दस्तावेजों के
अनुसार इनके दर रु०
199 प्रति ईकाई से रु० 400 तक नियत
किये गए. यह पूरी तरह गलत है और
दर्शाता है
कि किस प्रकार अत्यंत गरीब
लोगों को मच्छरदानी वितरित
करने के नाम पर
भ्रष्टाचार किया जा रहा है.
ज्ञातव्य हो कि असम, जहाँ
वेस्टरगार्द फ्रैंडसेन
द्वारा अपने मध्यस्थ मेसर्स
ग्लोबल बिजिनेस प्राइवेट
लिमिटेड, नयी दिल्ली के
माध्यम से रु० 400 प्रति
मच्छरदानी के दर से एलएलआइएन
सप्लाई किये गए हैं, के
मामले में उनके द्वारा
कॉन्ट्रेक्ट पाने के पूर्व ही
रुपये 295 प्रति ईकाई की
दर से वेस्टरगार्द फ्रैंडसेन
से उतनी ही मच्छरदानी की खरीद
का समझौता कर लिया
गया था, जो साफ़ दर्शाता है कि
इसमें असम सरकार के
अधिकारियों, वेस्टरगार्द
फ्रैंडसेन और ग्लोबल प्राइवेट
की मिलीभगत थी.
केन्द्रीय सतर्कता आयोग ने डॉ
शुक्ल को तीन महीने में जांच कर
रिपोर्ट देने के
आदेश दिए हैं.
The Central Vigilance Commission has handed over the investigation of
alleged high level corporate corruption to Dr R S Shukla, Joint Secretary
and Chief Vigilance Officer of the Ministry of Health and Family Welfare,
Government of India. The matter relates to a Swiss Multinational company
Vestergaard Frandsen Group SA’s Indian subsidiary Vestergaard Frandsen
India Private limited. National RTI Forum, a Lucknow-based Civil society
working in the field of transparency in governance had written to the CVC
that Vestergaard Frandsen India have supplied “long lasting insecticide
treated bed net” (LLIN) to different states at completely varying rates.
LLIN is of great need in all humid, forest areas where mosquitoes are a
great menace and hence many States of India and many Public authorities
purchase these LLIN to be distributed among the poorest people. Vestergaard
Frandsen is among the largest suppliers of these LLIN in India as in many
other countries of the world.
Dr Nutan Thakur, convener of the RTI Forum has said that the company
supplies these LLIN either through direct contract or through some
intermediaries, but the rates of these LLIN in both the cases varies in a
very large range, which points to foul play. As per the records available,
the rate of supply of these LLIN has varied between Rs. 199 per unit to Rs.
400 per unit. No due process has been adopted in giving these orders. This
is completely unacceptable and shows the active connivance of the public
authorities and the company and its intermediaries in looting the State
fund in the name of free supply of mosquito nets to the poorest people. Dr
Thakur has also alleged that in case of Assam, the intermediary Ms Global
Business Services Private Limited, New Delhi purchased these LLIN from Ms
Vestergaard Frandsen India Private limited even before the date of contract
with the Assam government at much lower rate of Rs. 295 per unit and later
supplied the same to the Assam government at Rs. 400 per unit.
CVC, through its order has asked Dr Shukla to complete his investigation in
three months and hand over a report to the CVC.
Dr Nutan Thakur
#94155-34525
>>> import guess_language
>>> guess_language.guessLanguage(text)
Segmentation fault
"""
Version 0.2 (as on pypy), using Debian 6.
Original issue reported on code.google.com by [email protected]
on 30 Mar 2012 at 11:03
If the module is compressed in a zip/egg file, "Blocks.txt" is not found and
"_loadBlocks" throw IOError.
Original issue reported on code.google.com by [email protected]
on 27 Apr 2012 at 2:34
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.