Code Monkey home page Code Monkey logo

multi_rake's Introduction

Multilingual Rapid Automatic Keyword Extraction (RAKE) for Python

https://travis-ci.org/vgrabovets/multi_rake.svg?branch=master CodeFactor

Features

  • Automatic keyword extraction from text written in any language
  • No need to know language of text beforehand
  • No need to have list of stopwords
  • 26 languages are currently available, for the rest - stopwords are generated from provided text
  • Just configure rake, plug in text and get keywords (see implementation details)

Installation

pip install multi-rake

If installation fails due to cld error narrowing conversions, than it can be installed with

CFLAGS="-Wno-narrowing" pip install multi-rake

Examples

English text, we don't specify explicitly language nor list of stopwords (built-in list is used).

from multi_rake import Rake

text_en = (
    'Compatibility of systems of linear constraints over the set of '
    'natural numbers. Criteria of compatibility of a system of linear '
    'Diophantine equations, strict inequations, and nonstrict inequations '
    'are considered. Upper bounds for components of a minimal set of '
    'solutions and algorithms of construction of minimal generating sets '
    'of solutions for all types of systems are given. These criteria and '
    'the corresponding algorithms for constructing a minimal supporting '
    'set of solutions can be used in solving all the considered types of '
    'systems and systems of mixed types.'
)

rake = Rake()

keywords = rake.apply(text_en)

print(keywords[:10])

#  ('minimal generating sets', 8.666666666666666),
#  ('linear diophantine equations', 8.5),
#  ('minimal supporting set', 7.666666666666666),
#  ('minimal set', 4.666666666666666),
#  ('linear constraints', 4.5),
#  ('natural numbers', 4.0),
#  ('strict inequations', 4.0),
#  ('nonstrict inequations', 4.0),
#  ('upper bounds', 4.0),
#  ('mixed types', 3.666666666666667),

Text written in Esperanto (article about liberalism). There is no list of stopwords for this language, they will be generated from provided text.

text consists of three first paragraphs of introduction. text_for_stopwords - all other text.

text = (
    'Liberalismo estas politika filozofio aŭ mondrigardo konstruita en '
    'ideoj de libereco kaj egaleco. Liberaluloj apogas larĝan aron de '
    'vidpunktoj depende de sia kompreno de tiuj principoj, sed ĝenerale '
    'ili apogas ideojn kiel ekzemple liberaj kaj justaj elektoj, '
    'civitanrajtoj, gazetara libereco, religia libereco, libera komerco, '
    'kaj privata posedrajto. Liberalismo unue iĝis klara politika movado '
    'dum la Klerismo, kiam ĝi iĝis populara inter filozofoj kaj '
    'ekonomikistoj en la okcidenta mondo. Liberalismo malaprobis heredajn '
    'privilegiojn, ŝtatan religion, absolutan monarkion kaj la Didevena '
    'Rajto de Reĝoj. La filozofo John Locke de la 17-a jarcento ofte '
    'estas meritigita pro fondado de liberalismo kiel klara filozofia '
    'tradicio. Locke argumentis ke ĉiu homo havas naturon rekte al vivo, '
    'libereco kaj posedrajto kaj laŭ la socia '
    'kontrakto, registaroj ne rajtas malobservi tiujn rajtojn. '
    'Liberaluloj kontraŭbatalis tradician konservativismon kaj serĉis '
    'anstataŭigi absolutismon en registaroj per reprezenta demokratio kaj '
    'la jura hegemonio.'
)

rake = Rake(max_words_unknown_lang=3)

keywords = rake.apply(text, text_for_stopwords=other_text)

print(keywords)

#  ('serĉis anstataŭigi absolutismon', 9.0)  # sought to replace absolutism
#  ('filozofo john locke', 8.5),  # philosopher John Locke
#  ('locke argumentis', 4.5)  # Locke argues
#  ('justaj elektoj', 4.0),  # fair elections
#  ('libera komerco', 4.0),  # free trade
#  ('okcidenta mondo', 4.0),  # western world
#  ('ŝtatan religion', 4.0),  # state religion
#  ('absolutan monarkion', 4.0),  # absolute monarchy
#  ('didevena rajto', 4.0),  # Dominican Rights
#  ('socia kontrakto', 4.0),  # social contract
#  ('jura hegemonio', 4.0),  # legal hegemony
#  ('mondrigardo konstruita', 4.0)  # worldview built
#  ('vidpunktoj depende', 4.0),  # views based
#  ('sia kompreno', 4.0),  # their understanding
#  ('tiuj principoj', 4.0),  # these principles
#  ('gazetara libereco', 3.5),  # freedom of press
#  ('religia libereco', 3.5),  # religious freedom
#  ('privata posedrajto', 3.5),  # private property
#  ('libereco', 1.5),  # liberty
#  ('posedrajto', 1.5)]  # property

So, we are able to get decent result without explicit set of stopwords.

Usage

Initialize rake object

from multi_rake import Rake

rake = Rake(
    min_chars=3,
    max_words=3,
    min_freq=1,
    language_code=None,  # 'en'
    stopwords=None,  # {'and', 'of'}
    lang_detect_threshold=50,
    max_words_unknown_lang=2,
    generated_stopwords_percentile=80,
    generated_stopwords_max_len=3,
    generated_stopwords_min_freq=2,
)

min_chars - word is selected to be part of keyword if its length is >= min_chars. Default 3

max_words - maximum number of words in phrase considered to be a keyword. Default 3

min_freq - minimum number of occurences of a phrase to be considered a keyword. Default 1

language_code - provide language code as string to use built-in set of stopwords. See list of available languages. If language is not specified algorithm will try to determine language with cld2 and use corresponding set of built-in stopwords. Default None

stopwords - provide own collection of stopwords (preferably as set, lowercased). Overrides language_code if it was specified. Default None

Keep language_code and stopwords as None and stopwords will be generated from provided text.

lang_detect_threshold - threshold for probability of detected language in cld2 (0-100). Default 50

max_words_unknown_lang - the same as max_words but will be used if language is unknown and stopwords are generated from provided text. Usually the best result is obtained when specifically crafted set of stopwords is used, in case of its absence and usage of generated stopwords resulting keywords may not be as pretty and it may be good idea, for example, to produce 2-word keywords for unknown languages and 3-word keywords for languages with predefined sets of stopwords. Default 2

generated_stopwords_percentile - to generate stopwords we create distribution of every word in text by frequency. Words above this percentile (0 - 100) will be considered candidates to become stopwords. Default 80

generated_stopwords_max_len - maximum character length of generated stopwords. Default 3

generated_stopwords_min_freq - minimum frequency of generated stopwords in the distribution. Default 2


Apply rake object to text.

keywords = rake.apply(
    text,
    text_for_stopwords=None,
)

text - string containing text from which keywords should be generated.

text_for_stopwords - string containing text which will be used for stopwords generation alongside text. For example, you have article with introduction and several subsections. You know that for your purposes keywords from introduction will suffice, you don't know language of text nor you have list of stopwords. So stopwords can be generated from text itself and the more text you have, the better. Than you may specify text=introduction, text_for_stopwords=rest_of_your_text.

Implementation Details

RAKE algorithm works as described in Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons

This implementation is different from others by its multilingual support. Basically you may provide text without knowing its language (it should be written with cyrillic or latin alphabets), without explicit list of stopwords and get decent result. Though the best result is achieved with thoroughly constructed list of stopwords.

What is happening under the hood:

  1. if stopwords are specified, then they will be used
  2. if language is specified, then built-in stopwords for this language will be used, if there are no built-in stopwords --> 4
  3. if language is not specified, then cld2 will try to determine language --> 2
  4. stopwords are generated from text and text_for_stopwords

We generate stopwords by creating frequency distribution of words in text and filtering them with parameters generated_stopwords_percentile, generated_stopwords_max_len, generated_stopwords_min_freq. We won't be able to generate them perfectly but it is rather easy to find articles and prepositions, because usually they consist of 3-4 characters and appear frequently. These stopwords, coupled with punctuation delimiters, enable us to get decent results for languages we don't understand.

List of Currently Available Languages

During RAKE initialization only language code should be used.

  • bg - Bulgarian
  • cs - Czech
  • da - Danish
  • de - German
  • el - Greek
  • en - English
  • es - Spanish
  • fa - Persian
  • fi - Finnish
  • fr - French
  • ga - Irish
  • hr - Croatian
  • hu - Hungarian
  • id - Indonesian
  • it - Italian
  • lt - Lithuanian
  • lv - latvian
  • nl - Dutch
  • no - Norwegian
  • pl - Polish
  • pt - Portuguese
  • ro - Romanian
  • ru - Russian
  • sk - Slovak
  • sv - Swedish
  • tr - Turkish
  • uk - Ukrainian

Development

Repository has configured linter, tests and coverage.

Create new virtual environment inside multi_rake folder in order to use it.

python3 -m venv env
source env/bin/activate

make install-dev  # install dependencies

make lint  # run linter

make test  # run tests and coverage

References

RAKE algorithm: Rose, S., Engel, D., Cramer, N., & Cowley, W. (2010). Automatic Keyword Extraction from Individual Documents. In M. W. Berry & J. Kogan (Eds.), Text Mining: Theory and Applications: John Wiley & Sons

As a basis RAKE implementation by fabianvf was used.

Stopwords: trec-kba, Ranks NL

multi_rake's People

Contributors

amir-ma71 avatar dependabot-preview[bot] avatar vgrabovets avatar zijuncui29 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

multi_rake's Issues

Error installing with pip

I found the following errors trying to install it with pip. Maybe it's not related with multi-rake indeed and it is with the library Pycdl2 but just in case...

  x86_64-linux-gnu-gcc -Wno-unused-result -Wsign-compare -DNDEBUG -g -fwrapv -O2 -Wall -g -fstack-protector-strong -Wformat -Werror=format-security -g -fwrapv -O2 -fPIC -I/tmp/pip-install-eru2tbk6/pycld2_afe9ac3933ed4ef2b826dde420fc878d/cld2/internal -I/tmp/pip-install-eru2tbk6/pycld2_afe9ac3933ed4ef2b826dde420fc878d/cld2/public -I/root/alike-social/env/include -I/usr/include/python3.10 -c /tmp/pip-install-eru2tbk6/pycld2_afe9ac3933ed4ef2b826dde420fc878d/bindings/encodings.cc -o build/temp.linux-x86_64-cpython-310/tmp/pip-install-eru2tbk6/pycld2_afe9ac3933ed4ef2b826dde420fc878d/bindings/encodings.o -w -O2 -m64 -fPIC
      error: command 'x86_64-linux-gnu-gcc' failed: No such file or directory
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> pycld2

Installing using the second command CFLAGS="-Wno-narrowing" pip install multi-rake gives another error:

wing -fPIC -I/tmp/pip-install-rh8ea47v/pycld2_4d9ed0a197634cc68c9a3c2c437ba00a/cld2/internal -I/tmp/pip-install-rh8ea47v/pycld2_4d9ed0a197634cc68c9a3c2c437ba00a/cld2/public -I/root/alike-social/env/include -I/usr/include/python3.10 -c /tmp/pip-install-rh8ea47v/pycld2_4d9ed0a197634cc68c9a3c2c437ba00a/bindings/encodings.cc -o build/temp.linux-x86_64-cpython-310/tmp/pip-install-rh8ea47v/pycld2_4d9ed0a197634cc68c9a3c2c437ba00a/bindings/encodings.o -w -O2 -m64 -fPIC
      error: command 'x86_64-linux-gnu-gcc' failed: No such file or directory
      [end of output]

  note: This error originates from a subprocess, and is likely not a problem with pip.
error: legacy-install-failure

× Encountered error while trying to install package.
╰─> pycld2

System:

  • Ubuntu SMP Sat May 21 02:24:07 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
  • Python 3.10.4

Steps to reproduce:

pip install multi-rake # will return the first error
CFLAGS="-Wno-narrowing" pip install multi-rake # will return the second error

Installation issue

Hi!

I am interested in using your tool, but when I try to install it, I receive this error:

` During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/tmp/pip-install-uu14tt9z/cld2-cffi/setup.py", line 191, in <module>
    'Topic :: Text Processing :: Linguistic'
  File "/home/francisco/.local/share/virtualenvs/rake-MKMDuRw-/lib/python3.6/site-packages/setuptools/__init__.py", line 144, in setup
    return distutils.core.setup(**attrs)
  File "/usr/lib/python3.6/distutils/core.py", line 148, in setup
    dist.run_commands()
  File "/usr/lib/python3.6/distutils/dist.py", line 955, in run_commands
    self.run_command(cmd)
  File "/usr/lib/python3.6/distutils/dist.py", line 974, in run_command
    cmd_obj.run()
  File "/home/francisco/.local/share/virtualenvs/rake-MKMDuRw-/lib/python3.6/site-packages/setuptools/command/egg_info.py", line 297, in run
    self.find_sources()
  File "/home/francisco/.local/share/virtualenvs/rake-MKMDuRw-/lib/python3.6/site-packages/setuptools/command/egg_info.py", line 304, in find_sources
    mm.run()
  File "/home/francisco/.local/share/virtualenvs/rake-MKMDuRw-/lib/python3.6/site-packages/setuptools/command/egg_info.py", line 535, in run
    self.add_defaults()
  File "/home/francisco/.local/share/virtualenvs/rake-MKMDuRw-/lib/python3.6/site-packages/setuptools/command/egg_info.py", line 571, in add_defaults
    sdist.add_defaults(self)
  File "/home/francisco/.local/share/virtualenvs/rake-MKMDuRw-/lib/python3.6/site-packages/setuptools/command/py36compat.py", line 34, in add_defaults
    self._add_defaults_python()
  File "/home/francisco/.local/share/virtualenvs/rake-MKMDuRw-/lib/python3.6/site-packages/setuptools/command/sdist.py", line 135, in _add_defaults_python
    build_py = self.get_finalized_command('build_py')
  File "/usr/lib/python3.6/distutils/cmd.py", line 299, in get_finalized_command
    cmd_obj.ensure_finalized()
  File "/usr/lib/python3.6/distutils/cmd.py", line 107, in ensure_finalized
    self.finalize_options()
  File "/home/francisco/.local/share/virtualenvs/rake-MKMDuRw-/lib/python3.6/site-packages/setuptools/command/build_py.py", line 34, in finalize_options
    orig.build_py.finalize_options(self)
  File "/usr/lib/python3.6/distutils/command/build_py.py", line 45, in finalize_options
    ('force', 'force'))
  File "/usr/lib/python3.6/distutils/cmd.py", line 287, in set_undefined_options
    src_cmd_obj.ensure_finalized()
  File "/usr/lib/python3.6/distutils/cmd.py", line 107, in ensure_finalized
    self.finalize_options()
  File "/tmp/pip-install-uu14tt9z/cld2-cffi/setup.py", line 143, in finalize_options
    self.distribution.ext_modules = get_ext_modules()
  File "/tmp/pip-install-uu14tt9z/cld2-cffi/setup.py", line 128, in get_ext_modules
    import cld2
  File "/tmp/pip-install-uu14tt9z/cld2-cffi/cld2/__init__.py", line 190, in <module>
    extra_compile_args=_COMPILER_ARGS)
  File "/tmp/pip-install-uu14tt9z/cld2-cffi/.eggs/cffi-1.14.0-py3.6-linux-x86_64.egg/cffi/api.py", line 468, in verify
    lib = self.verifier.load_library()
  File "/tmp/pip-install-uu14tt9z/cld2-cffi/.eggs/cffi-1.14.0-py3.6-linux-x86_64.egg/cffi/verifier.py", line 104, in load_library
    self._compile_module()
  File "/tmp/pip-install-uu14tt9z/cld2-cffi/.eggs/cffi-1.14.0-py3.6-linux-x86_64.egg/cffi/verifier.py", line 201, in _compile_module
    outputfilename = ffiplatform.compile(tmpdir, self.get_extension())
  File "/tmp/pip-install-uu14tt9z/cld2-cffi/.eggs/cffi-1.14.0-py3.6-linux-x86_64.egg/cffi/ffiplatform.py", line 22, in compile
    outputfilename = _build(tmpdir, ext, compiler_verbose, debug)
  File "/tmp/pip-install-uu14tt9z/cld2-cffi/.eggs/cffi-1.14.0-py3.6-linux-x86_64.egg/cffi/ffiplatform.py", line 58, in _build
    raise VerificationError('%s: %s' % (e.__class__.__name__, e))
cffi.VerificationError: CompileError: command 'x86_64-linux-gnu-gcc' failed with exit status 1
----------------------------------------

ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
`

Why "last" and "new" are stopwords while "past" is not?

I ran the following texts through Rake(default settings) and their output are included. "new" and "last" are not included but "last" is included. Just wonder how Rake makes decision on this.

Text: "new accounts opened in last 6 months" --> Rake output: "accounts opened","6 months"
Text: "number of contacts in past 12 months" --> Rake output: "contacts","past 12 months","number"

AttributeError: 'str' object has no attribute 'isnumeric'

i us python 2.7 this is code
import pytest
from multi_rake import Rake
rake = Rake(
min_chars=3,
max_words=3,
min_freq=1,
lang_detect_threshold=50,
max_words_unknown_lang=2,
generated_stopwords_percentile=80,
generated_stopwords_max_len=3,
generated_stopwords_min_freq=2,
)

text_en = (
'Compatibility of systems of linear constraints over the set of '
'natural numbers. Criteria of compatibility of a system of linear '
'Diophantine equations, strict inequations, and nonstrict inequations '
'are considered. Upper bounds for components of a minimal set of '
'solutions and algorithms of construction of minimal generating sets '
'of solutions for all types of systems are given. These criteria and '
'the corresponding algorithms for constructing a minimal supporting '
'set of solutions can be used in solving all the considered types of '
'systems and systems of mixed types.'
)

result = rake.apply(text_en)


AttributeError Traceback (most recent call last)
in ()
----> 1 result = rake.apply(text_en)

/usr/local/lib/python2.7/dist-packages/multi_rake-0.0.1-py2.7.egg/multi_rake/algorithm.pyc in apply(self, text, text_for_stopwords)
85 )
86
---> 87 word_scores = Rake._calculate_word_scores(phrase_list)
88
89 keywords = self._generate_candidate_keyword_scores(

/usr/local/lib/python2.7/dist-packages/multi_rake-0.0.1-py2.7.egg/multi_rake/algorithm.pyc in _calculate_word_scores(phrase_list)
184
185 for phrase in phrase_list:
--> 186 word_list = separate_words(phrase)
187 word_list_length = len(word_list)
188 word_list_degree = word_list_length - 1

/usr/local/lib/python2.7/dist-packages/multi_rake-0.0.1-py2.7.egg/multi_rake/utils.pyc in separate_words(text)
29
30 for word in text.split():
---> 31 if not word.isnumeric():
32 words.append(word)
33

AttributeError: 'str' object has no attribute 'isnumeric'

Muli-rake installation failed

I am trying to use multirake in a google colab notebook. i keep on getting this error
sing cached https://files.pythonhosted.org/packages/52/6d/044789e730141bcda2a7368836f714684a7d13bd44a2a33b387cb31b4335/cld2-cffi-0.1.4.tar.gz
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

i tried installing pip install cld2-cffi and it gives the error
collecting cld2-cffi
Using cached https://files.pythonhosted.org/packages/52/6d/044789e730141bcda2a7368836f714684a7d13bd44a2a33b387cb31b4335/cld2-cffi-0.1.4.tar.gz
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Segmentation fault from rake.apply function

Hello everyone,

I wanted to use the multi_rake keyword extractor. However, my code continuously shuts down because of a 'segmentation fault', which seems to be linked to the line "keywords = rake.apply(text=text)".

I create a class that uses the rake extractor and then wanted to use that class on a small Dutch text:

from multi_rake import Rake

class RakeKeywordExtractor():

    def __init__(self):
        # These are the default values, but we might want to adapt them!
        self.rake = Rake()

    def get_keywords(self, text, limit=None):
        if limit:
            keywords = self.rake.apply(text=text)
            return keywords[:limit]
        
        else:
            return self.rake.apply(text=text)
        
keyword_extractor = RakeKeywordExtractor()


tekst = """
De oorzaak van aften is niet bekend. We denken dat ze makkelijker ontstaan bij 1 of meer van deze dingen:

kleine wondjes in uw mond, bijvoorbeeld door:
bijten op uw wang
tandenpoetsen of flossen
een kunstgebit dat niet goed past
droge mond
stress
veranderingen in hormonen, bijvoorbeeld door ongesteld zijn of zwanger zijn
erfelijke aanleg: dit betekent dat veel mensen in uw familie aften hebben
heel soms bij te weinig ijzer, vitamine B12, of foliumzuur in uw bloed.
heel soms zijn aften een bijwerking van medicijnen
Bijvoorbeeld van sterke pijnstillers (fentanyl) of medicijnen bij kanker.
Er is geen bewijs dat deze dingen aften veroorzaken.
"""

keywords = keyword_extractor.get_keywords(tekst)
print("These are the keywords:")
for keyword in keywords:
    print(keyword)

I enabled fault handler to get more information about the segmentation fault and then got this :

Fatal Python error: Segmentation fault

Current thread 0x00000001ddd42080 (most recent call first):
  File "...venv/lib/python3.11/site-packages/multi_rake/utils.py", line 14 in detect_language
  File "...venv/lib/python3.11/site-packages/multi_rake/algorithm.py", line 62 in apply
  File "...backend/src/services/keyword_extraction/rake.py", line 18 in get_keywords
  File "...backend/src/services/keyword_extraction/rake.py", line 40 in <module>

Extension modules: numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, pvectorc, pycld2._pycld2, regex._regex (total: 16)
[1]    12942 segmentation fault  venv/bin/python -Xfaulthandler 

The error seems to be linked to the detect_language function in multi_rake/utils.py.

Does anybody maybe know what is causing this segmentation error and how I can resolve it?

Thank you!

Kind regards,

Birgit Bartels

Empty list returned when working with Devanagri Script

Hi im working with texts in Devanagri Script (A Popular script used in India unlike the Latin Script used by English like languages). When I try to generate keywords it returns an empty list. Code is below.

full_text="शेवणें आनी शेतकार एक आसलेलो शेतकार तेणें बरें शेत रोयलेलें रोयल्यार कितें जालें थाम वाडलें आनी इल्लें इल्लें करून पोटराक येयलें आनी थोडे दीस वयतकच कुचकुचीत गोट्याचें कणस सुटलें आनी वाऱ्याचेर बरें धोलूंक लागलें शेतकाराक सामकी उमेद जाली आतां म्हण लागलो रोकडेंच आपूण शेत लुंवतलो आनी भात घरा व्ह"

rake = Rake(max_words_unknown_lang=1)

keywords = rake.apply(full_text)

error: input contains invalid UTF-8 around byte XXXX

I am trying to extract keywords from amazon_reviews dataset, when using it for spanish i encounter this error that am unable to resolve.

STACK TRACE
/python3.8/site-packages/multi_rake/algorithm.py in apply(self, text, text_for_stopwords)
     60 
     61         else:
---> 62             language_code = detect_language(text, self.lang_detect_threshold)
     63 
     64             if language_code is not None and language_code in STOPWORDS:

/opt/conda/lib/python3.8/site-packages/multi_rake/utils.py in detect_language(text, proba_threshold)
     12 
     13 def detect_language(text, proba_threshold):
---> 14     _, _, details = pycld2.detect(text)
     15 
     16     language_code = details[0][1]

error: input contains invalid UTF-8 around byte 2094 (of 5341)

Is there a workaround by manually entering Language code or something ?

pip install multi-rake give this error: Command python setup.py egg_info failed with error code 1 in /tmp/pip-install-pvx6l20l/cld2-cffi/

Cannot install multi-rake
pip install multi-rake gives this error:
Command python setup.py egg_info failed with error code 1 in /tmp/pip-install-pvx6l20l/cld2-cffi/

environment: google colab

Collecting multi-rake Using cached https://files.pythonhosted.org/packages/5f/26/8d1fd22c5e1bf65936bae9f8201df2fa8cefe6fa0b28f471384e8101b298/multi_rake-0.0.1-py3-none-any.whl Collecting cld2-cffi>=0.1.4 (from multi-rake) Using cached https://files.pythonhosted.org/packages/52/6d/044789e730141bcda2a7368836f714684a7d13bd44a2a33b387cb31b4335/cld2-cffi-0.1.4.tar.gz **Command "python setup.py egg_info" failed with error code 1 in /tmp/pip-install-pvx6l20l/cld2-cffi/**

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.