Code Monkey home page Code Monkey logo

Comments (4)

codelucas avatar codelucas commented on July 20, 2024

Hello,
I doubt that this issue is related to python-goose. However, does this error also occur
when you are running goose?

My guess is that this error is due to permissions and that you don't have
permissions to access the file .../text/stopwords-tr.txt. However, i'm not entirely sure,
could you please paste the entire stack-trace?

Note:
https://github.com/codelucas/newspaper/blob/master/newspaper/text.py#L67

from newspaper.

karls avatar karls commented on July 20, 2024

Hi @00krishna and @codelucas

I actually ran into the same issue just today and dug a bit deeper.

It turns out that the default configuration extracts the language from the article's source (HTML) and just uses that. So if the source happens to contain <html lang="uk"> the language will be set to uk -- that was the case for me, anyway. The OutputFormatter class will try to remove paragraphs with a small amount of words (via remove_fewwords_paragraphs method) and during that, will try to load in the stopwords file for the particular language extracted from the article's source. In the uk case, /Users/karl/.virtualenvs/juicer-mk2/lib/python2.7/site-packages/newspaper/utils/../resources/text/stopwords-uk.txt will be read, which just doesn't exist. Similarly, as @00krishna found out, stopwords-tr.txt does not exist either.

I'd say it's nobody's fault. The webpage may specify whatever garbage they want as the language and newspaper will just try to look it up. I think a good solution might be to have mappings from "things that are english" (like uk, en, gb) to just en, but obviously for all the languages.

from newspaper.

codelucas avatar codelucas commented on July 20, 2024

@karls very good idea. I'll add an "english language mapping". Also a good idea to catch the exception of the language code being missing and just log out the error instead of halting all of newspaper.

from newspaper.

codelucas avatar codelucas commented on July 20, 2024

This issue has been patched.
Ref: 246ad93

from newspaper.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.