Code Monkey home page Code Monkey logo

news-please's People

Contributors

abidibo avatar ahacad avatar amjltc295 avatar anteverse avatar arcolife avatar cookieshake avatar donglixp avatar eynand avatar fhamborg avatar frankier avatar fshafalir avatar jkawamoto avatar lgov avatar loganamcnichols avatar medno avatar megatron-me-uk avatar moyid avatar mxab avatar ntlf avatar petlack avatar phdowling avatar sebastian-nagel avatar shangw-nvidia avatar shradhasehgal avatar somnathrakshit avatar stepinsilence avatar t1h0 avatar thihara avatar tsoernes avatar yldoctrine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

news-please's Issues

news-please: CLI issue

I am a new user of python with a fresh install of Python 3.6.4 on a 64bit Windows 8.1 laptop. I have installed elasticsearch, newspaper3k, and news-please using pip3. I am able to use the library commands, but I need the functionality of the CLI interface to gather articles from news sites. I've run the CLI, but I continue to get the error shown in the attached image. I get this error with or without admin access for the command window. I've checked the config files as mentioned in the wiki. What am I missing?

Thank you very much for your help with this!

image

Version conflict on python-dateutil

Hi there,
I'm currently running Python 2.7 on Ubuntu 14.04, installed news-please using pip.
Tried running news-please using cli and got this error:

Traceback (most recent call last):
File "/usr/local/bin/news-please", line 6, in
from pkg_resources import load_entry_point
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 3147, in
@_call_aside
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 3131, in _call_aside
f(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 3160, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 668, in _build_master
return cls._build_from_requirements(requires)
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 681, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/usr/local/lib/python2.7/dist-packages/pkg_resources/init.py", line 875, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (python-dateutil 2.6.1 (/usr/local/lib/python2.7/dist-packages), Requirement.parse('python-dateutil==2.4.0'), set(['newspaper']))

lxml version requirements

I get an error installing news-please using pip:
No matching distribution found for lxml>=3.35 (from news-please)

It looks like the version number comparison should be 3.3.5 instead of 3.35.

Re-enable MySQL

I'm working on configuring newsplease to use MySQL for persistent storage. I'm running newplease and mysql in docker containers on my workstation. Newsplease is running in CLI mode. MySQL has been tested using a SQL client (Toad) and I can create databases, read from them, etc.

I updated the setting in my newsplease config file to use MySQL (updated username & password). When I start the docker container newplease starts crawling and saving json & html files to the file system but it is not writing to the MySQL database.

In the documentation, there is a mention of a init-db.sql script that can be used to setup the database. This doesn't seem to be in the repo.

Configuration:
To use this module you have to enter the address, the used port and if needed your user credentials into the MySQL section of newscrawler.cfg. There is also a setup script init-db.sql for a convenient creation of the used tables.

In main.py I see reset MySQL but there is no mention of this in the documentation.

ImportError: No module named _thread

import newsplease
Traceback (most recent call last):
File "", line 1, in
File "/home/hi-161/home/nes/ajay/local/lib/python2.7/site-packages/newsplease/init.py", line 6, in
from newsplease.single_crawler import SingleCrawler
File "/home/hi-161/home/nes/ajay/local/lib/python2.7/site-packages/newsplease/single_crawler.py", line 26, in
from _thread import start_new_thread
ImportError: No module named _thread

fix wiki links

looks like many have been broken when transferring the repo

Error running commoncrawl.py

After running commoncrawl.py for like 15min it throws following error:

DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): ads.civitasmedia.com
DEBUG:urllib3.connectionpool:http://ads.civitasmedia.com:80 "GET /w/1.0/ai?auid=465268&cs=517002e209b24&cb=18879 HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:http://ads.civitasmedia.com:80 "GET /w/1.0/ai?cc=1&auid=465268&cs=517002e209b24&cb=18879 HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): u.openx.net
DEBUG:urllib3.connectionpool:http://u.openx.net:80 "GET /w/1.0/sc?r=http%3A%2F%2Fads.civitasmedia.com%2Fw%2F1.0%2Fai%3Fcc%3D1%26auid%3D465268%26cs%3D517002e209b24%26cb%3D18879 HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:http://u.openx.net:80 "GET /w/1.0/sc?cc=1&r=http%3A%2F%2Fads.civitasmedia.com%2Fw%2F1.0%2Fai%3Fcc%3D1%26auid%3D465268%26cs%3D517002e209b24%26cb%3D18879 HTTP/1.1" 302 0
DEBUG:urllib3.connectionpool:http://ads.civitasmedia.com:80 "GET /w/1.0/ai?mi=1bbd358a-0aa6-45e6-927b-7cc5cdbeab95&ma=1497001411&mr=1498211012&mn=1&mc=1&cc=1&auid=465268&cs=517002e209b24&cb=18879 HTTP/1.1" 200 43
DEBUG:urllib3.connectionpool:Starting new HTTP connection (1): 11jo8z152kaa38lham19pzzv.wpengine.netdna-cdn.com
DEBUG:urllib3.connectionpool:http://11jo8z152kaa38lham19pzzv.wpengine.netdna-cdn.com:80 "GET /images/civitasreverse.png HTTP/1.1" 200 16957
INFO:__main__:article discard (sunburynews.com; None; Sunbury News)
INFO:__main__:statistics
INFO:__main__:pass = 0, discard = 160, total = 160
INFO:__main__:extraction from current WARC file started 10 minutes, 41 seconds ago; 4.012312 s/article
INFO:__main__:article discard (istoe.com.br; 2016-08-08 09:53:00; Olimp\xc3\xadada tem quebra de sete recordes mundiais)
INFO:__main__:article discard (brejo.com; 2013-12-22 00:00:00; FOTOS: Col\xc3\xa9gio da Luz realiza a 10\xc2\xaa edi\xc3\xa7\xc3\xa3o do Auto do Natal Luz)
Traceback (most recent call last):
  File "commoncrawl.py", line 271, in <module>
    common_crawl.run()
  File "commoncrawl.py", line 237, in run
    self.__process_warc_gz_file(local_path_name)
  File "commoncrawl.py", line 199, in __process_warc_gz_file
    article = NewsPlease.from_warc(record)
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newsplease/__init__.py", line 34, in from_warc
    article = NewsPlease.from_html(html, url)
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newsplease/__init__.py", line 68, in from_html
    item = extractor.extract(item)
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newsplease/pipeline/extractor/article_extractor.py", line 53, in extract
    article_candidates.append(extractor.extract(item))
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newsplease/pipeline/extractor/extractors/newspaper_extractor.py", line 30, in extract
    article.parse()
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newspaper/article.py", line 219, in parse
    meta_data = self.extractor.get_meta_data(self.clean_doc)
  File "/home/andreas/Envs/newsi/lib/python3.5/site-packages/newspaper/extractors.py", line 514, in get_meta_data
    ref[part] = value
TypeError: 'int' object does not support item assignment

assets

All recources stored here were created with the online tool draw.io.
If you want to change a file, visit www.draw.io and open the picture with their online tool.
The tool will recover the .xml version of the diagram and you can perform your edits.

Please be sure to include the .xml version when exporting the diagram into a picture. ;)

km4-article-extrator-class-diagram
km4-article-extrator-class-diagram

mysql-er-diagram
mysql-er-diagram

news-please-class-diagram
news-please-class-diagram

news-please-flowchart
news-please-flowchart

Improve ComparerDescription, ComparerAuthor, ComparerDate, ComparerTopimage

These comparers return the result from newspaper if there is one. Since newspaper is working pretty well, this is effective. However, if you implement further extractors, these comparers can be improved:

ComparerDate: There can be a check if the extracted result is a valid date. Additionally, the extracted dates could be written in different ways but actually show the same date. A good comparer would see that.

ComparerAuthor: A comparer that checks for the similarity of extracted authors would be nice. This could be realized similarly to ComparerTitle. Sometimes the extractor extracts wrong authors, sometimes up two five authors. Maybe a limit of four authors would be great.

ComparerDescription: A measure of similarity would be great. Additionally, an interaction with the ComparerText would be great. What happens if the result from ComparerDescription is None? You could take the first paragraph from the text extracted by comparer text. Additionally, you can valid your result: Is the result from ComparerDescritption included in the main text or not? What happens if there is no Descritption in the article? Should the result be None? Or the first paragraph of the main text? What happens when the first paragraph is the Description? Would this create kind of unnecessary redundancy which would influence the further work flow?

ComparerTopimage: Sometimes there is no Topimage but a video. A method to deal with such a problem would be great.

Couldn't complete the installation

When I try to install this library I got following error.

error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": http://landinghub.visualstudio.com/visual-cpp-build-tools

    ----------------------------------------
Command "c:\python36-32\python.exe -u -c "import setuptools, tokenize;__file__='C:\\Users\\Nishara\\AppData\\Local\\Temp\\pip-install-9axz4aui\\Twisted\\setup.py';f=getattr(tokenize, 'open', open)(__file__);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, __file__, 'exec'))" install --record C:\Users\Nishara\AppData\Local\Temp\pip-record-_xo3iyqt\install-record.txt --single-version-externally-managed --compile" failed with error code 1 in C:\Users\Nishara\AppData\Local\Temp\pip-install-9axz4aui\Twisted\

My python version is python 3.6

run news-please in a cluster

News-please has been very easy to setup & test. I've been getting excellent results during my testing, now I'm considering putting it in to production.

I'm trying to figure out the best way to run multiple instances in an AWS ECS cluster. If multiple News-please crawlers point to the same MySQL database, will this allow them to distribute the tasks across a cluster?

If not; do you have any thoughts on how to run news-please in a scalable cluster?

How to limit the time spend on the same website

Hi first of all I want to thank you for this project.
Now, I what I need is to limit the time that the spider spend on the same website, for example the last time that I execute it I was having almost 1.5 Gb dowloaded from the same webiste.
For the moment I'm using the configuration by defaults and I only specify the urls for the sites that i want to crawl.

Thank you very much.

Can't run example

Hello, I am trying to work out with the examples.

I installed it, simply running "pip install newsplease.zip"

After that I am running python downloadfromurl.py and am getting the following error:

Traceback (most recent call last):
File "downloadfromurl.py", line 16, in
with open(basepath + article['filename'] + '.json', 'w') as outfile:
TypeError: 'NewsArticle' object is not subscriptable`

I change basepath and url.

What am I missing? `

"Unhandled Error in Deferred"

Hi,
i am working with Python 3.6.3 and pip 9.0.1 on Ubuntu 16.04.3 LTS.
news-please is not working right after the installation.

The installation process was absolved successfully.
After starting "news-please" in the cli an "Unhandled error in Deffered" pops up.
In python programms "newsplease" can be imported, but not used.
The call news-please had to be made with sudo, else there appears an Permission Error.

You can see the stacktrace below.

Cheers,
Raphael

`raphael@raphael-Latitude-E6330:~$ sudo news-please
[sudo] password for raphael:
[newsplease.config:165|INFO] Loading config-file (/home/raphael/news-please-repo/config/config.cfg)
[newsplease.config:165|INFO] Loading config-file (/home/raphael/news-please-repo/config/config.cfg)
[main:253|INFO] Removed /home/raphael/news-please-repo/.resume_jobdir/f03a98d15778ac99eeb8c578aa8c224b since '--resume' was not passed to initial.py or this crawler was daemonized.
[newsplease.config:165|INFO] Loading config-file (/home/raphael/news-please-repo/config/config.cfg)
[newsplease.config:165|INFO] Loading config-file (/home/raphael/news-please-repo/config/config.cfg)
Unhandled error in Deferred:

[main:253|INFO] Removed /home/raphael/news-please-repo/.resume_jobdir/861e0b7ca3034017282d27dce656d520 since '--resume' was not passed to initial.py or this crawler was daemonized.
[main:253|INFO] Removed /home/raphael/news-please-repo/.resume_jobdir/5011d55eaa1b745eefb709134271e173 since '--resume' was not passed to initial.py or this crawler was daemonized.
Unhandled error in Deferred:

Unhandled error in Deferred:

[newsplease.main:270|INFO] Graceful stop called manually. Shutting down.
`

Unable to use cli "news-please

Traceback (most recent call last):
File "/usr/local/bin/news-please", line 7, in
from newsplease.main import main
File "/usr/local/lib/python2.7/site-packages/newsplease/init.py", line 13, in
from dotmap import DotMap
ImportError: No module named dotmap

AttributeError: 'module' object has no attribute 'request'

Hello!

After a fresh install I ran the example code from the readme file and it gave me the following error:

>>> from newsplease import NewsPlease
>>> article = NewsPlease.from_url('https://www.nytimes.com/2017/02/23/us/politics/cpac-stephen-bannon-reince-priebus.html?hp')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/gregor/anaconda/lib/python2.7/site-packages/newsplease/__init__.py", line 79, in from_url
    articles = NewsPlease.from_urls([url])
  File "/Users/gregor/anaconda/lib/python2.7/site-packages/newsplease/__init__.py", line 98, in from_urls
    html = SimpleCrawler.fetch_url(url)
  File "/Users/gregor/anaconda/lib/python2.7/site-packages/newsplease/crawler/simple_crawler.py", line 16, in fetch_url
    return SimpleCrawler._fetch_url(url, False)
  File "/Users/gregor/anaconda/lib/python2.7/site-packages/newsplease/crawler/simple_crawler.py", line 27, in _fetch_url
    req = urllib.request.Request(url, None, headers)
AttributeError: 'module' object has no attribute 'request'

Python 2.7.13 (on macOS 10.12.5 Sierra)
urllib is installed

Unfortunately I had no time for debugging. Might come back to it soon.

Gregor

Problem with using ElasticSeach

I have installed this docker image on my Mac, and it starts fine. Nothing changed configuration-wise.

When I run news-please, however, I get the following error:
`[newsplease.config:163|INFO] Loading config-file (/Users/lukaskawerau/news-please-repo/config/config.cfg)
[newsplease.config:163|INFO] Loading config-file (/Users/lukaskawerau/news-please-repo/config/config.cfg)
[newsplease.config:163|INFO] Loading config-file (/Users/lukaskawerau/news-please-repo/config/config.cfg)
[newsplease.config:163|INFO] Loading config-file (/Users/lukaskawerau/news-please-repo/config/config.cfg)
Unhandled error in Deferred:

Unhandled error in Deferred:

Unhandled error in Deferred:

[newsplease.main:270|INFO] Graceful stop called manually. Shutting down.`

Is there anything, any steps I'm missing to get news-please to run with ES?
Running it with basic json/html-export works fine.

Any help appreciated!

Error installing news-please

C:\Users\nithi>pip3 install news-please
Collecting news-please
Using cached news-please-1.2.35.tar.gz
Collecting Scrapy>=1.1.0 (from news-please)
Using cached Scrapy-1.4.0-py2.py3-none-any.whl
Collecting PyMySQL>=0.7.9 (from news-please)
Using cached PyMySQL-0.8.0-py2.py3-none-any.whl
Collecting hjson>=1.5.8 (from news-please)
Using cached hjson-3.0.1.tar.gz
Collecting elasticsearch>=2.4 (from news-please)
Using cached elasticsearch-6.0.0-py2.py3-none-any.whl
Collecting beautifulsoup4>=4.3.2 (from news-please)
Using cached beautifulsoup4-4.6.0-py3-none-any.whl
Collecting readability-lxml>=0.6.2 (from news-please)
Using cached readability-lxml-0.6.2.tar.gz
Collecting langdetect>=1.0.7 (from news-please)
Using cached langdetect-1.0.7.zip
Collecting python-dateutil>=2.4.0 (from news-please)
Using cached python_dateutil-2.6.1-py2.py3-none-any.whl
Collecting plac>=0.9.6 (from news-please)
Using cached plac-0.9.6-py2.py3-none-any.whl
Collecting dotmap>=1.2.17 (from news-please)
Using cached dotmap-1.2.20.tar.gz
Collecting PyDispatcher>=2.0.5 (from news-please)
Using cached PyDispatcher-2.0.5.tar.gz
Collecting warcio>=1.3.3 (from news-please)
Using cached warcio-1.5.1-py2.py3-none-any.whl
Collecting ago>=0.0.9 (from news-please)
Using cached ago-0.0.92.tar.gz
Collecting six>=1.10.0 (from news-please)
Using cached six-1.11.0-py2.py3-none-any.whl
Collecting lxml>=3.3.5 (from news-please)
Using cached lxml-4.1.1-cp36-cp36m-win32.whl
Collecting awscli>=1.11.117 (from news-please)
Using cached awscli-1.14.16-py2.py3-none-any.whl
Collecting hurry.filesize>=0.9 (from news-please)
Using cached hurry.filesize-0.9.tar.gz
Collecting newspaper3k (from news-please)
Using cached newspaper3k-0.2.5.tar.gz
Collecting pywin32>=220 (from news-please)
Could not find a version that satisfies the requirement pywin32>=220 (from news-please) (from versions: )
No matching distribution found for pywin32>=220 (from news-please)

C:\Users\nithi>pip install pypiwin32
Collecting pypiwin32
Downloading pypiwin32-220-cp36-none-win32.whl (8.3MB)
100% |████████████████████████████████| 8.3MB 69kB/s
Installing collected packages: pypiwin32
Successfully installed pypiwin32-220

C:\Users\nithi>pip3 install news-please
Collecting news-please
Using cached news-please-1.2.35.tar.gz
Collecting Scrapy>=1.1.0 (from news-please)
Using cached Scrapy-1.4.0-py2.py3-none-any.whl
Collecting PyMySQL>=0.7.9 (from news-please)
Using cached PyMySQL-0.8.0-py2.py3-none-any.whl
Collecting hjson>=1.5.8 (from news-please)
Using cached hjson-3.0.1.tar.gz
Collecting elasticsearch>=2.4 (from news-please)
Using cached elasticsearch-6.0.0-py2.py3-none-any.whl
Collecting beautifulsoup4>=4.3.2 (from news-please)
Using cached beautifulsoup4-4.6.0-py3-none-any.whl
Collecting readability-lxml>=0.6.2 (from news-please)
Using cached readability-lxml-0.6.2.tar.gz
Collecting langdetect>=1.0.7 (from news-please)
Using cached langdetect-1.0.7.zip
Collecting python-dateutil>=2.4.0 (from news-please)
Using cached python_dateutil-2.6.1-py2.py3-none-any.whl
Collecting plac>=0.9.6 (from news-please)
Using cached plac-0.9.6-py2.py3-none-any.whl
Collecting dotmap>=1.2.17 (from news-please)
Using cached dotmap-1.2.20.tar.gz
Collecting PyDispatcher>=2.0.5 (from news-please)
Using cached PyDispatcher-2.0.5.tar.gz
Collecting warcio>=1.3.3 (from news-please)
Using cached warcio-1.5.1-py2.py3-none-any.whl
Collecting ago>=0.0.9 (from news-please)
Using cached ago-0.0.92.tar.gz
Collecting six>=1.10.0 (from news-please)
Using cached six-1.11.0-py2.py3-none-any.whl
Collecting lxml>=3.3.5 (from news-please)
Using cached lxml-4.1.1-cp36-cp36m-win32.whl
Collecting awscli>=1.11.117 (from news-please)
Using cached awscli-1.14.16-py2.py3-none-any.whl
Collecting hurry.filesize>=0.9 (from news-please)
Using cached hurry.filesize-0.9.tar.gz
Collecting newspaper3k (from news-please)
Using cached newspaper3k-0.2.5.tar.gz
Collecting pywin32>=220 (from news-please)
Could not find a version that satisfies the requirement pywin32>=220 (from news-please) (from versions: )
No matching distribution found for pywin32>=220 (from news-please)

improve comparertext

The ComparerText compares a text with each other text and calculates a score of similarity. Depending on this score, one of the text that are most similar are returned. There can be several improvements:

What happens when there are 2 texts that are very similar and one text that is not similar to the others but the best extraction in this example. So one of the similar texts would be extracted. The problem is that there is no check for extractor biases. Say there is a tag in a html file which is not extracted correctly by most of the extractors, however this wrong extraction would be extracted. So a multiple extracted, wrong result would be extracted when you use just a similarity measure.

What happens when there are two similar, but wrongly extracted texts and three not so similar texts for which every text is better than one of the two very similar ones? One of the two similar ones would get extracted. This is why there should actually be a method that compares every partition with other partitions. The result would be more correct, however you loose speed.

The similarity score was created by a group of students. There is research about text comparison and maybe there are better ways to check for the similarity of texts.

Inconsistent beautifulsoup4 dependency for Python 2.7

python2.7 version of news-please requires beautifulsoup4>=4.5.1 and newspaper>=0.0.9.8.

newspaper >=0.0.9.8 requires beautifulsoup4==4.3.2, creating a beautifulsoup4 dependency issue when trying to install news-please with pip.

Is there a way around this issue? Python3 is not an option.

Running Windows7.

Cut down on what is published per article

Hi guys, love the tool, I was just looking for a way to cut down on the fields written to file and change the field names. Is there a file where I can edit these settings?

specifically I want to make it so the ouput files only contain:

'url': {'type': 'string', 'index': 'not_analyzed'},
'source': {'type': 'string', 'index': 'not_analyzed'},
'created_at#(renamed published_date)#': {'type': 'date', "format":"yyyy-MM-dd HH:mm:ss"},
'title': {'type': 'string'},
'text': {'type': 'string'},
'author': {'type': 'string'}

improve json export

  • config: where to store
  • config: format: pretty or compact
  • change file ending (currently .html.json)

HTTP Error 505: HTTP Version not supported

Hi,
I try to getting started by running commoncrawl.py but encountered this error. I checked everything I know but still no luck. Do you happen to know what this issue is about? Attached is the error log.
errorlog.txt

Thank you for your time.

Improve ComparerLanguage

At the moment, the comparer just counts how often any language was detected. But there are several problems:

  • the comparer does not consider that different language detectors perhaps use different short cuts for the same language.

  • is there no most frequently detected language, the comparer will take the language which is extracted by newspaper (because it is very accurate). But this is maybe not the best solution.

Error installing via pip under Windows

I experienced a problem with the installation via pip under Windows 10 (64-bit) and Python 3.5.

Collecting pywin32>=220 (from news-please)
  Could not find a version that satisfies the requirement pywin32>=220 (from news-please) (from versions: )
No matching distribution found for pywin32>=220 (from news-please)

Is there a quick solution to this problem?

News please stopping in AWS EC2

In a AWS Ec2 instance with Ubuntu t2.micro instance, I have setup news-please to test if news-please works properly or not. But whenever, I am starting news-please, its failing giving error "Unhandled error in Deferred:".

I tried starting in DEBUG mode also. I am attaching the stacktrace over here:
_

/usr/local/lib/python3.5/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.13.1) or chardet (2.3.0) doesn't match a supported version!
RequestsDependencyWarning)
[newsplease.config:165|INFO] Loading config-file (/home/ubuntu/news-please-repo/config/config.cfg)
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Crawler] default
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] url_input_file_name
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] working_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] local_data_directory
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [MySQL] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] ca_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_key_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_level
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_format
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_dateformat
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_encoding
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] jobdirname
[newsplease.config:267|DEBUG] Loading JSON-file (/home/ubuntu/news-please-repo/config/sitelist.hjson)
[newsplease.main:255|DEBUG] Calling Process: ['/usr/bin/python3', '/usr/local/lib/python3.5/dist-packages/newsplease/single_crawler.py', '/home/ubuntu/news-please-repo/config/config.cfg', '/home/ubuntu/news-please-repo/config/sitelist.hjson', '0', 'False', 'False']
[newsplease.main:255|DEBUG] Calling Process: ['/usr/bin/python3', '/usr/local/lib/python3.5/dist-packages/newsplease/single_crawler.py', '/home/ubuntu/news-please-repo/config/config.cfg', '/home/ubuntu/news-please-repo/config/sitelist.hjson', '1', 'False', 'False']
[newsplease.main:255|DEBUG] Calling Process: ['/usr/bin/python3', '/usr/local/lib/python3.5/dist-packages/newsplease/single_crawler.py', '/home/ubuntu/news-please-repo/config/config.cfg', '/home/ubuntu/news-please-repo/config/sitelist.hjson', '2', 'False', 'False']
/usr/local/lib/python3.5/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.13.1) or chardet (2.3.0) doesn't match a supported version!
RequestsDependencyWarning)
/usr/local/lib/python3.5/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.13.1) or chardet (2.3.0) doesn't match a supported version!
RequestsDependencyWarning)
/usr/local/lib/python3.5/dist-packages/requests/init.py:80: RequestsDependencyWarning: urllib3 (1.13.1) or chardet (2.3.0) doesn't match a supported version!
RequestsDependencyWarning)
[newsplease.config:165|INFO] Loading config-file (/home/ubuntu/news-please-repo/config/config.cfg)
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Crawler] default
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] url_input_file_name
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] working_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] local_data_directory
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [MySQL] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] ca_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_key_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_level
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_format
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_dateformat
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_encoding
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] jobdirname
[main:88|DEBUG] Config initialized - Further initialisation.
[newsplease.config:267|DEBUG] Loading JSON-file (/home/ubuntu/news-please-repo/config/sitelist.hjson)
[newsplease.config:165|INFO] Loading config-file (/home/ubuntu/news-please-repo/config/config.cfg)
[main:192|DEBUG] Using crawler RecursiveCrawler for https://www.dig-in.com/.
[main:253|INFO] Removed /home/ubuntu/news-please-repo/.resume_jobdir/55ae9ff89e530b083b20633a558d116b since '--resume' was not passed to initial.py or this crawler was daemonized.
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Crawler] default
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] url_input_file_name
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] working_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] local_data_directory
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [MySQL] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] ca_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_key_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_level
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_format
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_dateformat
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_encoding
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] jobdirname
[main:88|DEBUG] Config initialized - Further initialisation.
[newsplease.config:267|DEBUG] Loading JSON-file (/home/ubuntu/news-please-repo/config/sitelist.hjson)
[newsplease.config:165|INFO] Loading config-file (/home/ubuntu/news-please-repo/config/config.cfg)
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Crawler] default
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] url_input_file_name
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] working_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Files] local_data_directory
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [MySQL] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] host
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] ca_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_cert_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Elasticsearch] client_key_path
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_level
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_format
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_dateformat
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] log_encoding
[newsplease.config:167|DEBUG] Option not literal_eval-parsable (maybe string): [Scrapy] jobdirname
[main:88|DEBUG] Config initialized - Further initialisation.
[newsplease.config:267|DEBUG] Loading JSON-file (/home/ubuntu/news-please-repo/config/sitelist.hjson)
[main:192|DEBUG] Using crawler RecursiveCrawler for http://www.insurancejournal.com/.
[main:253|INFO] Removed /home/ubuntu/news-please-repo/.resume_jobdir/8294b9c2cc3db1cadb6b4a98109c8590 since '--resume' was not passed to initial.py or this crawler was daemonized.
[main:192|DEBUG] Using crawler RecursiveCrawler for http://www.dnaindia.com/.
[main:253|INFO] Removed /home/ubuntu/news-please-repo/.resume_jobdir/6906ed0b1a6ca7bd359e919a4fd74596 since '--resume' was not passed to initial.py or this crawler was daemonized.
Unhandled error in Deferred:

Unhandled error in Deferred:

Unhandled error in Deferred:

[newsplease.main:270|INFO] Graceful stop called manually. Shutting down.

_

Please help.

Version Conflict on 3.5

Hi again,
I have tried running news-please on python 3.5 and Ubunutu 16.02 in the cli.
A version conflict was raised. See stack trace below.

Cheers, Raphael

raphael@raphael-Latitude-E6330:~$ sudo news-please
Traceback (most recent call last):
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 635, in _build_master
ws.require(requires)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 943, in require
needed = self.resolve(parse_requirements(requirements))
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 834, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (lxml 3.5.0 (/usr/lib/python3/dist-packages), Requirement.parse('lxml>=3.6.0'), {'newspaper3k'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/bin/news-please", line 5, in
from pkg_resources import load_entry_point
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2927, in
@_call_aside
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2913, in _call_aside
f(*args, **kwargs)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 2940, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 637, in _build_master
return cls._build_from_requirements(requires)
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 650, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/usr/lib/python3/dist-packages/pkg_resources/init.py", line 834, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (lxml 3.5.0 (/usr/lib/python3/dist-packages), Requirement.parse('lxml>=3.6.0'), {'newspaper3k'})

add wiki doc for direct url crawl and extract

Hi Felix,

du musst nur den Download-Crawler verwenden und die Heuristiken ausschalten.
Im Anhang findest du ein entsprechendes JSON.

Beste Grüße
Sören

# Furthermore this is first of all the actual config file, but as default just filled with examples.
{
  # Every URL has to be in an array-object in "base_urls".
  # The same URL in combination with the same crawler may only appear once in this array.
  "base_urls" : [
    {
      "crawler": "Download",
      "url": [
        # Cubs win Championship ~03.11.2016
        "http://www.dailymail.co.uk/news/article-3899956/Chicago-Cubs-win-World-Series-epic-Game-7-showdown-Cleveland.html",
        "http://www.mirror.co.uk/sport/other-sports/american-sports/chicago-cubs-win-world-series-9185077",
        "https://www.theguardian.com/sport/2016/nov/03/world-series-game-7-chicago-cubs-cleveland-indians-mlb",
        "http://www.telegraph.co.uk/baseball/2016/11/03/chicago-cubs-break-108-year-curse-of-the-billy-goat-winning-worl/",
        "https://www.thesun.co.uk/sport/othersports/2106710/chicago-cubs-win-world-series-hillary-clinton-bill-murray-and-barack-obama-lead-celebrations-as-cubs-end-108-year-curse/",
        "http://www.bbc.com/sport/baseball/37857919",
        "http://www.thetimes.co.uk/article/chicago-cubs-end-108-year-wait-for-world-series-win-g09t0kgfm",
        "http://www.independent.co.uk/sport/us-sport/major-league-baseball/world-series-chicago-cubs-cleveland-indians-108-year-title-drought-a7394706.html",
        "http://www.independent.co.uk/sport/us-sport/major-league-baseball/chicago-cubs-fans-celebrate-world-series-title-a7394736.html",
        "http://www.standard.co.uk/sport/other-sports/chicago-cubs-win-world-series-to-end-108year-curse-and-earn-invite-from-barack-obama-a3386411.html",
        "http://www.nytimes.com/2016/11/03/sports/baseball/chicago-cubs-beat-cleveland-indians-world-series-game-7.html?_r=0",
        "https://www.washingtonpost.com/sports/believe-it-chicago-cubs-win-classic-game-7-to-win-first-world-series-since-1908/2016/11/03/99cfc9c2-a0b3-11e6-a44d-cc2898cfab06_story.html",
        "https://www.washingtonpost.com/sports/nationals/you-knew-it-couldnt-come-easy-but-the-cubs-are-world-series-champions/2016/11/03/a4487ade-a0b3-11e6-a44d-cc2898cfab06_story.html",
        "http://www.usatoday.com/story/sports/ftw/2016/11/03/sports-world-reacts-to-the-chicago-cubs-winning-their-first-world-series-since-1908/93225730/",
        "http://www.wsj.com/articles/chicago-cubs-win-the-world-series-ending-108-year-drought-1478149589",
        "http://nypost.com/2016/11/03/cubs-end-drought-in-chaotic-epic-world-series-finale/",


        # FBI clears Clinton ~06.11-2016
        "http://www.dailymail.co.uk/wires/reuters/article-3910804/Trump-Clinton-focus-crucial-states-campaigns-final-days.html",
        "http://www.mirror.co.uk/news/world-news/hillary-clinton-cleared-fbi-over-9210739",
        "https://www.theguardian.com/us-news/2016/nov/06/fbi-director-hillary-clinton-email-investigation-criminal-james-comey",
        "http://www.bbc.com/news/election-us-2016-37892348",
        "https://www.thesun.co.uk/news/2130219/trumps-fury-after-fbi-says-hillary-clinton-has-committed-no-crime-in-email-scandal/",
        "http://www.thetimes.co.uk/article/clinton-off-the-hook-as-fbi-drops-investigation-into-emails-m9t6pnr0s",
        "http://www.express.co.uk/news/world/729545/hillary-clinton-chelsea-clinton-wedding-funds-wikileaks-rudy-giuliani",
        "http://www.telegraph.co.uk/news/2016/11/06/us-election-hillary-clinton-up-in-polls-as-hispanic-surge-threat/",
        "http://nypost.com/2016/11/06/fbi-stands-by-decision-not-to-charge-clinton-after-review-of-additional-emails/",
        "http://www.wsj.com/articles/fbis-comey-says-new-emails-dont-change-conclusions-about-hillary-clinton-1478464650",
        "http://www.usatoday.com/story/news/politics/elections/2016/2016/11/06/fbi-not-recommending-charges-over-new-clinton-emails/93395808/",
        "https://www.washingtonpost.com/blogs/post-partisan/wp/2016/11/06/comey-to-country-the-jury-will-disregard/?utm_term=.914ce12b2617",
        "http://www.nytimes.com/2016/11/06/us/politics/presidential-election.html",

        # China: rescue boy from well ~08.11.2016
        "http://www.bbc.com/news/world-asia-china-37906226",
        "http://www.bbc.com/news/world-asia-china-37946716",
        "http://www.dailymail.co.uk/news/article-3916560/Dramatic-footage-shows-rescuers-using-eighty-diggers-save-boy-fell-130ft-deep-picking-cabbages-Chinese-farm.html",
        "http://www.dailymail.co.uk/news/article-3923808/Mystery-Chinese-boy-fell-deep-massive-rescue-operation-involving-80-diggers-chute-empty.html",
        "http://www.thetimes.co.uk/article/millions-watch-as-rescue-effort-fails-to-save-boy-52c6dlpml",

        # Toblerone redesign outrage ~08.11.2016
        "http://www.standard.co.uk/news/uk/toblerone-bar-shape-change-sparks-anger-among-fans-a3389711.html",
        "http://www.independent.co.uk/news/uk/home-news/toblerone-new-shape-outrage-chocolate-scandal-a7404011.html",
        "https://www.thesun.co.uk/news/2138318/toblerone-fans-outraged-after-gap-between-triangles-is-increased-to-reduce-the-amount-of-chocolate-in-bars/",
        "http://www.thetimes.co.uk/article/toblerone-redesign-runs-into-a-mountain-of-trouble-82h785rfr",
        "http://www.telegraph.co.uk/news/2016/11/08/toblerone-faces-mountain-of-criticism-over-stupid-change-to-its/",
        "https://www.theguardian.com/business/2016/nov/08/toblerone-gets-more-gappy-but-its-fans-are-not-happy",
        "http://www.mirror.co.uk/news/uk-news/is-diet-version-outrage-toblerone-9217538",
        "http://www.dailymail.co.uk/news/article-3915960/Toblerone-increase-gaps-bar-s-iconic-peaks-make-lighter.html",
        "http://www.bbc.com/news/uk-37904703",
        "http://www.mirror.co.uk/news/weird-news/best-way-use-controversial-new-9233101",
        "http://www.express.co.uk/life-style/food/730108/Toblerone-shrinks-chocolate-bar-denies-Brexit-link",
        "http://nypost.com/2016/11/08/people-are-pissed-over-toblerones-new-candy-size/",
        "http://blogs.wsj.com/moneybeat/2016/11/08/mondelezs-toblerone-moves-mountain-to-hide-price-increase/",
        "http://www.usatoday.com/story/money/nation-now/2016/11/08/while-us-talks-election-uk-outraged-over-toblerone-chocolate/93465240/",
        "https://www.washingtonpost.com/news/worldviews/wp/2016/11/08/brits-blame-strange-new-toblerone-shape-on-brexit/",
        "http://www.nytimes.com/2016/11/09/world/europe/toblerone-triangle-change-uk.html",

        # Trump wins election
        # Anti-Trump protests ~09.11.2016
        "http://www.independent.co.uk/news/world/americas/us-elections/us-election-donald-trump-wins-protests-los-angeles-california-oregon-a7407521.html",
        "http://www.thetimes.co.uk/article/thousands-protest-against-election-result-in-us-cities-5v6ncl6pg",
        "http://www.telegraph.co.uk/news/2016/11/10/demonstrations-erupt-across-the-us-as-country-begins-to-imagine/",
        "http://www.express.co.uk/news/world/730363/protests-Donald-Trump-violence-US-election-Hillary-Clinton",
        "http://www.mirror.co.uk/news/world-news/trump-win-sparks-riots-across-9225317",
        "http://www.bbc.com/news/election-us-2016-37946231",
        "http://www.dailymail.co.uk/wires/ap/article-3920168/Trump-victory-sets-protests-California-Oregon.html",
        "https://www.theguardian.com/us-news/2016/nov/09/anti-donald-trump-protests-new-york-chicago-san-francisco",
        "http://www.nytimes.com/2016/11/10/us/trump-election-protest-berkeley-oakland.html",
        "https://www.washingtonpost.com/politics/trumps-white-house-win-promises-to-reshape-us-political-landscape/2016/11/09/62baa5e4-a66a-11e6-ba59-a7d93165c6d4_story.html"
        "http://www.usatoday.com/story/news/politics/2016/11/10/hundreds-protest-trump-downtown-milwaukee/93617960/",
        "http://www.wsj.com/articles/thousands-protest-outside-trump-tower-1478742884",
        "http://nypost.com/2016/11/09/protests-erupt-in-california-after-trump-victory/",
        "http://nypost.com/2016/11/11/anti-trump-protests-continue-in-wake-of-election/",

        # Lady Gaga protests ~09.11.2016
        "http://www.standard.co.uk/showbiz/celebrity-news/lady-gaga-stages-protest-outside-trump-towers-after-donald-trump-beats-hillary-clinton-a3391526.html",
        "http://www.mirror.co.uk/3am/celebrity-news/lady-gaga-protests-outside-trump-9228523",
        "http://www.independent.co.uk/news/people/president-donald-trump-lady-gaga-protest-tower-new-york-a7407081.html",
        "http://www.telegraph.co.uk/news/2016/11/09/lady-gaga-protests-outside-trump-tower/",
        "http://www.dailymail.co.uk/news/article-3918926/Hollywood-starts-panic-results-aren-t-going-Clinton-s-way.html",

        # Croydon tram accident
        "https://www.theguardian.com/uk-news/2016/nov/09/croydon-tram-crash-kills-at-least-seven-and-injures-more-than-50",
        "http://www.standard.co.uk/news/transport/croydon-tram-derailment-people-trapped-after-tram-overturns-in-at-sandilands-a3390796.html",
        "http://www.bbc.com/news/uk-england-london-37919658",
        "http://www.express.co.uk/news/uk/730639/Croydon-tram-crash-carnage-survivor-derailment-seven-dead",
        "http://www.mirror.co.uk/news/uk-news/huge-rescue-operation-sandilands-station-9226276",
        "http://www.dailymail.co.uk/wires/pa/article-3919284/Five-trapped-40-injured-tram-overturns-tunnel.html",
        "http://www.telegraph.co.uk/news/2016/11/10/croydon-tram-crash-police-check-drivers-mobile-phone-records/",
        "https://www.thesun.co.uk/news/2150294/croydon-tram-crash-derailment-cause/",
        "http://www.thetimes.co.uk/article/at-least-four-dead-and-dozens-injured-as-tram-derails-vqpsbrjb3",
        "http://www.independent.co.uk/news/uk/home-news/five-trapped-40-injured-after-tram-overturns-south-london-croydon-a7406496.html",
        "http://nypost.com/2016/11/09/several-dead-and-dozens-injured-after-tram-overturns-in-london/",
        "http://www.usatoday.com/story/news/2016/11/09/least-7-killed-tram-accident-south-london/93549248/",
        "http://www.nytimes.com/2016/11/10/world/europe/tram-derails-croydon-london.html",

        # Croydon tram accident follow-up
        "http://www.standard.co.uk/news/london/croydon-tram-crash-police-identify-all-seven-victims-killed-in-derailment-tragedy-a3394126.html",
        "http://www.independent.co.uk/news/uk/home-news/croydon-tram-crash-victims-named-last-london-derail-tributes-a7414006.html",
        "https://www.thesun.co.uk/news/2172708/croydon-tram-crash-victims-named/",
        "http://www.dailymail.co.uk/news/article-3929748/Croydon-tram-crash-carriages-carried-away-lorry-police-probe-claims-derailed-just-days-seven-people-died-tragedy.html",
        "http://www.thetimes.co.uk/article/young-father-killed-in-tram-crash-x3zh5dp0v",

        #shooting near protests ~10.11.2016
        "https://www.thesun.co.uk/news/2154878/shot-seattle-protests-donald-trump-election/",
        "http://www.express.co.uk/news/world/730721/Five-gunned-Seattle-shot-anti-Donald-Trump-US-President-Washington-victory-Republican",
        "http://www.dailymail.co.uk/news/article-3922446/Report-shooting-multiple-victims-near-Trump-protest-Seattle-PD.html",
        "http://www.wsj.com/articles/five-people-shot-in-seattle-unclear-if-connected-to-trump-protest-1478749638",

        # Trump meets Obama in the white house
        "http://www.bbc.com/news/election-us-2016-37932231",
        "http://www.dailymail.co.uk/news/article-3922932/Transition-Obama-Trump-meet-White-House.html",
        "http://www.standard.co.uk/news/world/barack-obama-describes-first-meeting-with-donald-trump-at-white-house-as-excellent-a3392866.html",
        "https://www.theguardian.com/us-news/live/2016/nov/10/donald-trump-barack-obama-white-house-us-election-live-updates",
        "http://www.independent.co.uk/news/world/americas/donald-trump-meets-barack-obama-body-language-president-president-elect-a7412186.html",
        "http://www.mirror.co.uk/news/world-news/donald-trump-barack-obama-hold-9234917",
        "http://www.thetimes.co.uk/article/rbtrump-3dc5hngts",
        "http://www.thetimes.co.uk/article/two-bitter-rivals-meet-at-the-white-house-br30bmkhn",
        "http://www.express.co.uk/news/world/730940/Donald-Trump-Barack-Obama-White-House-Washington-US-election-2016",
        "http://www.telegraph.co.uk/news/2016/11/10/donald-trump-and-barack-obamas-meeting-was-awkward-they-looked-l/",
        "https://www.thesun.co.uk/news/2159514/donald-trump-arrives-in-washington-ahead-of-power-transition-talks-with-president-barack-obama/",
        "https://www.washingtonpost.com/news/post-politics/wp/2016/11/10/obama-to-welcome-trump-to-white-house-for-first-meeting-since-election/",
        "http://www.usatoday.com/story/news/politics/elections/2016/11/10/obama-trump-white-house-transition/93581810/",
        "http://www.wsj.com/articles/trump-obama-set-to-begin-transition-1478787730",

        # german consulate attack ~10.11.2016
        "https://www.theguardian.com/world/2016/nov/10/taliban-attack-german-consulate-mazar-i-sharif-afghanistan-nato-airstrikes-kunduz",
        "http://www.bbc.com/news/world-asia-37944115",
        "http://www.independent.co.uk/news/world/middle-east/german-consulate-afghanistan-attacked-bomb-suicide-taliban-revenge-mazar-i-sharif-kunduz-attack-two-a7410746.html",
        "http://www.thetimes.co.uk/article/two-killed-in-bomb-attack-on-consulate-mttnh9pt9",
        "https://www.thesun.co.uk/news/2162467/taliban-suicide-bomber-truck-german-consulate-afghanistan-killing-two/amp/",
        "http://www.telegraph.co.uk/news/2016/11/10/taliban-attack-german-consulate-in-northern-afghan-city-of-mazar/",
        "http://www.express.co.uk/news/world/731052/German-consulate-explosion-gunfire-Afghanistan",
        "http://www.nytimes.com/2016/11/11/world/asia/taliban-strike-german-consulate-in-afghan-city-of-mazar-i-sharif.html?mtrref=query.nytimes.com&gwh=792F9F9ECEB17B00C71C4F8444293AD8&gwt=pay",
        "http://www.wsj.com/articles/german-consulate-in-afghanistan-attacked-1478817411"

        # Clinton blames FBI director Comey ~12.11.2016
        "http://www.independent.co.uk/news/people/hillary-clinton-blames-fbi-director-james-comeys-decision-to-to-reopen-email-probe-for-defeat-to-a7414021.html",
        "http://www.thetimes.co.uk/article/clinton-accuses-fbi-chief-of-costing-her-the-election-7czltxsxm",
        "http://www.bbc.com/news/election-us-2016-37963965",
        "https://www.thesun.co.uk/news/2168742/hillary-clinton-aide-blames-us-presidential-loss-to-donald-trump-on-fbi-chief-because-he-cleared-her-of-wrongdoing-over-weiner-emails/",
        "http://www.telegraph.co.uk/news/2016/11/12/hillary-clinton-blames-election-loss-on-fbis-james-comey-in-call/",
        "http://www.express.co.uk/news/world/731721/I-BLAME-COMEY-Bitter-Clinton-blames-FBI-chief-James-Comey-election-defeat-Donald-Trump",
        "http://www.mirror.co.uk/news/world-news/hillary-clinton-blames-fbi-director-9250867",
        "http://www.dailymail.co.uk/news/article-3930928/Hillary-Clinton-blames-FBI-Director-James-Comey-election-defeat.html",
        "https://www.theguardian.com/us-news/2016/nov/12/hillary-clinton-james-comey-letters-emails-election-defeat",
        "http://nypost.com/2016/11/12/clinton-blames-comeys-email-probe-for-her-defeat/",
        "http://www.wsj.com/articles/hillary-clinton-attributes-fbi-letters-as-factor-in-election-loss-1478994890",
        "https://www.washingtonpost.com/news/post-politics/wp/2016/11/12/hillary-clinton-blames-one-comey-letter-for-stopping-momentum-and-the-other-for-turning-out-trump-voters/",
        "http://www.nytimes.com/2016/11/13/us/politics/hillary-clinton-james-comey.html",

        # F1 Grand Prix Brazil ~13.11.2016
        # Hamilton wins
        "http://www.bbc.com/sport/formula1/37953887",
        "http://www.telegraph.co.uk/formula-1/2016/11/13/brazilian-grand-prix-live/",
        "http://www.thetimes.co.uk/article/champion-reigns-in-drenched-brazil-to-keep-title-hopes-alive-t922kw5hj",
        "https://www.thesun.co.uk/sport/2177105/lewis-hamilton-wins-the-brazilian-grand-prix-after-two-red-flags/",
        "https://www.theguardian.com/sport/live/2016/nov/13/f1-brazilian-grand-prix-live",

        # Verstappen avoids crash
        "https://www.theguardian.com/sport/blog/2016/nov/14/max-verstappen-brazilian-grand-prix-felipe-massa",
        "http://www.telegraph.co.uk/formula-1/2016/11/13/max-verstappen-even-stuns-his-dad-by-storming-home-into-third-pl/",
        "http://www.express.co.uk/sport/f1-autosport/731858/Max-Verstappen-avoids-crash-Kimi-Raikkonen-Brazilian-Grand-Prix-wet",
        "http://www.mirror.co.uk/sport/formula-1/red-bull-boss-christian-horner-9254708",
        "http://www.dailymail.co.uk/sport/formulaone/article-3932890/Max-Verstappen-amazes-Red-Bull-principal-Christian-Horner-performance-Brazil-witnessed-special.html",
        "http://www.dailymail.co.uk/sport/sportsnews/article-3934424/Formula-One-star-Max-Verstappen-shows-nerves-steel-avoid-accident.html",

        # Reikonnen / Massa crash
        "http://www.mirror.co.uk/sport/formula-1/brazilian-f1-grand-prix-riddled-9253267",
        "http://www.dailymail.co.uk/sport/formulaone/article-3932386/F1-legend-Felipe-Massa-makes-emotional-farewell-crashing-Brazil-Grand-Prix-Interlagos.html",
        "https://www.thesun.co.uk/sport/2177804/felipe-massa-retires-f1-legend-makes-a-very-emotional-farewell-after-crashing-in-his-last-home-race-in-brazil/",
        "http://www.dailymail.co.uk/sport/formulaone/article-3932252/Brazilian-Grand-Prix-thrown-chaos-Kimi-Raikkonen-accident-brings-red-flag-Sebastian-Vettel-fumes-stupid-conditions-mad.html",
        "http://www.standard.co.uk/sport/brazilian-grand-prix-redflagged-after-dramatic-kimi-raikkonen-crash-a3394411.html",
        "http://www.usatoday.com/story/sports/motor/formula1/2016/11/13/brazils-massa-crashes-but-gets-warm-farewell-at-home-gp/93771246/",
        "https://www.washingtonpost.com/sports/auto-racing/brazils-massa-crashes-but-gets-warm-farewell-at-home-gp/2016/11/13/007a15d4-a9e6-11e6-8f19-21a1c65d2043_story.html"

      ],

      "overwrite_heuristics": {
        "meta_contains_article_keyword": true,
        "og_type": false,
        "linked_headlines": false,
        "self_linked_headlines": false
      },
    }
  ]
}

ValueError: bad marshal data (unknown type code)

On Ubuntu 16.10, I got the above error.

File "/home/yuyuan/anaconda3/lib/python3.5/distutils/core.py", line 148, in setup
dist.run_commands()
File "/home/yuyuan/anaconda3/lib/python3.5/distutils/dist.py", line 955, in run_commands
self.run_command(cmd)
File "/home/yuyuan/anaconda3/lib/python3.5/distutils/dist.py", line 974, in run_command
cmd_obj.run()
File "/home/yuyuan/anaconda3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg/setuptools/command/bdist_egg.py", line 209, in run
File "/home/yuyuan/anaconda3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg/setuptools/command/bdist_egg.py", line 245, in zip_safe
File "/home/yuyuan/anaconda3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg/setuptools/command/bdist_egg.py", line 355, in analyze_egg
File "/home/yuyuan/anaconda3/lib/python3.5/site-packages/setuptools-27.2.0-py3.5.egg/setuptools/command/bdist_egg.py", line 392, in scan_module
ValueError: bad marshal data (unknown type code

RSS Crawler issues

Running a fresh install of python 3.5.4 on Win8.1 64bit with a fresh install of news-please. The example CLI runs without a problem, but when I try to modify the config file, rss crawlers throws and error that says no crawler found. See image below.
image

(adding) Pipeline / Filter - keyword(s) Filter

Hey there,

while almost all news sites structure their sites thematically (and therefor broad thematic crawling is possible) or using the elasticsearch (??) or databases indirectly for that matter later on is it planned to add said pipeline-filter (that we can drop non-keyword articles on the run) in the future?

Or did I miss something. Have crawled through the docs but can't find anything in that regard.

Best wishes

Merge articles spread on multiple pages

Example: http://www.zeit.de/2016/18/ttip-barack-obama-hannover-usa-widerstand Under the given URL only the first part of the article is shown. A (human) reader can either click on a link that points to the second page or can click on "Auf einer Seite lesen" to read all on one page.

What will be the output of the current workflow? Ideally of course multiple pages should be identified and crawled as a single article. However, as this requires actual processing of the article, I expect the system to crawl this article as two articles?
If so, is there any way to easily identify (e.g., during the actual article extraction performed by the km4 team) that two (or more) articles actually belong to only one?

Answer:

It depends on the crawler:

The sitemap and RSS crawler only find pages that are listed in the corresponding files. Thus, those crawlers only find the listed article, which might be the first page, all pages, the entire article or a combination.

The recursive crawlers on the other hand will find all pages as well as the entire article and, if the heuristics work for those, will save all of them.

For latter one, a possible way to identity if articles belong together is to search for commen text parts since all pages should be part of the entire article.

For both, it would be possible to extract URLs with keywords like "continue reading" or "page x" etc.

Download article does not complete

When I try to download an article, my query

newsplease.NewsPlease.download_article('http://www.thehindu.com/todays-p ...: aper/tp-national/tp-otherstates/Elephant-destroys-three-houses-in-Meghal ...: aya-village/article17026523.ece')

does not complete or time-out on Python 3.5.2. On Python 2.7.3, however, it returns a result, but also returns the following TypeError

Traceback (most recent call last): File "/Users/mdmadhusudan/anaconda/envs/earthengine/lib/python2.7/site-packages/twisted/internet/defer.py", line 653, in _runCallbacks current.result = callback(current.result, *args, **kw) File "/Users/mdmadhusudan/anaconda/envs/earthengine/lib/python2.7/site-packages/newsplease/pipeline/pipelines.py", line 365, in process_item json.dump(ExtractedInformationStorage.extract_relevant_info(item), file_) TypeError: unbound method extract_relevant_info() must be called with ExtractedInformationStorage instance as first argument (got NewscrawlerItem instance instead)

I am on macOS Sierra. Any suggestions on how to fix either? Thanks.

Timeout URL retrieval

Hi,

I wonder how to properly integrate a timeout for an URL retrieval? For example, below URL keeps on running on my machine (I'm on Windows in case useful):

url = 'http://www.nasdaq.com/article/forex-eurusd-keeps-pushing-higher-cm232206?utm_source=feedburner&utm_medium=feed&utm_campaign=Feed%3A+nasdaq%2Fcategories+%28Articles+by+Category%29'
out = NewsPlease.from_url(url)

The problem is more general. I would like to say: try to retrieve this URL for at maximum N seconds, else quit.

I ran without success through the options outlined here: https://stackoverflow.com/questions/492519/timeout-on-a-function-call. The function just keeps running.

I couldn't find anything in the wiki, nor config file. Maybe I overlooked it. Hope somebody can help.

Many thanks,

Sam

Error during installation of news-please

Ians-MacBook-Air:~ ianmackerracher$ sudo pip install news-please
Password:
The directory '/Users/ianmackerracher/Library/Caches/pip/http' or its parent directory is not owned by the current user and the cache has been disabled. Please check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
The directory '/Users/ianmackerracher/Library/Caches/pip' or its parent directory is not owned by the current user and caching wheels has been disabled. check the permissions and owner of that directory. If executing pip with sudo, you may want sudo's -H flag.
Collecting news-please
Downloading news-please-1.0.25.tar.gz (46kB)
100% |████████████████████████████████| 51kB 1.0MB/s
Collecting Scrapy>=1.1.0 (from news-please)
Downloading Scrapy-1.3.2-py2.py3-none-any.whl (239kB)
100% |████████████████████████████████| 245kB 650kB/s
Collecting PyMySQL>=0.7.9 (from news-please)
Downloading PyMySQL-0.7.10-py2.py3-none-any.whl (78kB)
100% |████████████████████████████████| 81kB 1.9MB/s
Collecting hjson>=1.5.8 (from news-please)
Downloading hjson-2.0.2.tar.gz
Collecting elasticsearch>=2.4 (from news-please)
Downloading elasticsearch-5.2.0-py2.py3-none-any.whl (57kB)
100% |████████████████████████████████| 61kB 1.6MB/s
Collecting beautifulsoup4>=4.5.1 (from news-please)
Downloading beautifulsoup4-4.5.3-py2-none-any.whl (85kB)
100% |████████████████████████████████| 92kB 2.5MB/s
Collecting readability-lxml>=0.6.2 (from news-please)
Downloading readability-lxml-0.6.2.tar.gz
Collecting langdetect>=1.0.7 (from news-please)
Downloading langdetect-1.0.7.zip (998kB)
100% |████████████████████████████████| 1.0MB 536kB/s
Collecting python-dateutil>=2.4.0 (from news-please)
Downloading python_dateutil-2.6.0-py2.py3-none-any.whl (194kB)
100% |████████████████████████████████| 194kB 894kB/s
Collecting plac>=0.9.6 (from news-please)
Downloading plac-0.9.6-py2.py3-none-any.whl
Collecting newspaper (from news-please)
Downloading newspaper-0.0.9.8.tar.gz (248kB)
100% |████████████████████████████████| 256kB 1.1MB/s
Requirement already satisfied: lxml in /Library/Python/2.7/site-packages (from Scrapy>=1.1.0->news-please)
Requirement already satisfied: PyDispatcher>=2.0.5 in /Library/Python/2.7/site-packages (from Scrapy>=1.1.0->news-please)
Requirement already satisfied: Twisted>=13.1.0 in /Library/Python/2.7/site-packages (from Scrapy>=1.1.0->news-please)
Requirement already satisfied: pyOpenSSL in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from Scrapy>=1.1.0->news-please)
Requirement already satisfied: queuelib in /Library/Python/2.7/site-packages (from Scrapy>=1.1.0->news-please)
Collecting w3lib>=1.15.0 (from Scrapy>=1.1.0->news-please)
Downloading w3lib-1.17.0-py2.py3-none-any.whl
Collecting cssselect>=0.9 (from Scrapy>=1.1.0->news-please)
Downloading cssselect-1.0.1-py2.py3-none-any.whl
Collecting parsel>=1.1 (from Scrapy>=1.1.0->news-please)
Downloading parsel-1.1.0-py2.py3-none-any.whl
Collecting service-identity (from Scrapy>=1.1.0->news-please)
Downloading service_identity-16.0.0-py2.py3-none-any.whl
Collecting six>=1.5.2 (from Scrapy>=1.1.0->news-please)
Downloading six-1.10.0-py2.py3-none-any.whl
Collecting urllib3<2.0,>=1.8 (from elasticsearch>=2.4->news-please)
Downloading urllib3-1.20-py2.py3-none-any.whl (111kB)
100% |████████████████████████████████| 112kB 1.5MB/s
Collecting chardet (from readability-lxml>=0.6.2->news-please)
Downloading chardet-2.3.0-py2.py3-none-any.whl (180kB)
100% |████████████████████████████████| 184kB 1.7MB/s
Collecting Pillow==2.5.1 (from newspaper->news-please)
Downloading Pillow-2.5.1-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.whl (3.0MB)
100% |████████████████████████████████| 3.0MB 336kB/s
Collecting PyYAML==3.11 (from newspaper->news-please)
Downloading PyYAML-3.11.zip (371kB)
100% |████████████████████████████████| 378kB 1.4MB/s
Collecting nltk==2.0.5 (from newspaper->news-please)
Downloading nltk-2.0.5.tar.gz (954kB)
100% |████████████████████████████████| 962kB 821kB/s
Collecting requests==2.3.0 (from newspaper->news-please)
Downloading requests-2.3.0-py2.py3-none-any.whl (452kB)
100% |████████████████████████████████| 460kB 911kB/s
Collecting jieba==0.35 (from newspaper->news-please)
Downloading jieba-0.35.zip (7.4MB)
100% |████████████████████████████████| 7.4MB 137kB/s
Collecting feedparser==5.1.3 (from newspaper->news-please)
Downloading feedparser-5.1.3.zip (1.2MB)
100% |████████████████████████████████| 1.2MB 933kB/s
Collecting tldextract==1.5.1 (from newspaper->news-please)
Downloading tldextract-1.5.1.tar.gz (57kB)
100% |████████████████████████████████| 61kB 1.6MB/s
Collecting feedfinder2==0.0.1 (from newspaper->news-please)
Downloading feedfinder2-0.0.1.tar.gz
Requirement already satisfied: zope.interface>=3.6.0 in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Requirement already satisfied: constantly>=15.1 in /Library/Python/2.7/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Requirement already satisfied: incremental>=16.10.1 in /Library/Python/2.7/site-packages (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Collecting pyasn1 (from service-identity->Scrapy>=1.1.0->news-please)
Downloading pyasn1-0.2.2-py2.py3-none-any.whl (51kB)
100% |████████████████████████████████| 61kB 3.6MB/s
Collecting pyasn1-modules (from service-identity->Scrapy>=1.1.0->news-please)
Downloading pyasn1_modules-0.0.8-py2.py3-none-any.whl
Collecting attrs (from service-identity->Scrapy>=1.1.0->news-please)
Downloading attrs-16.3.0-py2.py3-none-any.whl
Requirement already satisfied: setuptools in /System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python (from tldextract==1.5.1->newspaper->news-please)
Installing collected packages: six, w3lib, cssselect, parsel, pyasn1, pyasn1-modules, attrs, service-identity, Scrapy, PyMySQL, hjson, urllib3, elasticsearch, beautifulsoup4, chardet, readability-lxml, langdetect, python-dateutil, plac, Pillow, PyYAML, nltk, requests, jieba, feedparser, tldextract, feedfinder2, newspaper, news-please
Found existing installation: six 1.4.1
DEPRECATION: Uninstalling a distutils installed project (six) has been deprecated and will be removed in a future version. This is due to the fact that uninstalling a distutils project will only partially uninstall the project.
Uninstalling six-1.4.1:
Exception:
Traceback (most recent call last):
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/basecommand.py", line 215, in main
status = self.run(options, args)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/commands/install.py", line 342, in run
prefix=options.prefix_path,
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_set.py", line 778, in install
requirement.uninstall(auto_confirm=True)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_install.py", line 754, in uninstall
paths_to_remove.remove(auto_confirm)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/req/req_uninstall.py", line 115, in remove
renames(path, new_path)
File "/Library/Python/2.7/site-packages/pip-9.0.1-py2.7.egg/pip/utils/init.py", line 267, in renames
shutil.move(old, new)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 302, in move
copy2(src, real_dst)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 131, in copy2
copystat(src, dst)
File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/shutil.py", line 103, in copystat
os.chflags(dst, st.st_flags)
OSError: [Errno 1] Operation not permitted: '/tmp/pip-YSDQTZ-uninstall/System/Library/Frameworks/Python.framework/Versions/2.7/Extras/lib/python/six-1.4.1-py2.7.egg-info'

%%%%%%%%%Try to install by ignoring already installed%%%%%%%%%

Ians-MacBook-Air:~ ianmackerracher$ pip install --ignore-installed news-please
Collecting news-please
Using cached news-please-1.0.25.tar.gz
Collecting Scrapy>=1.1.0 (from news-please)
Using cached Scrapy-1.3.2-py2.py3-none-any.whl
Collecting PyMySQL>=0.7.9 (from news-please)
Using cached PyMySQL-0.7.10-py2.py3-none-any.whl
Collecting hjson>=1.5.8 (from news-please)
Using cached hjson-2.0.2.tar.gz
Collecting elasticsearch>=2.4 (from news-please)
Using cached elasticsearch-5.2.0-py2.py3-none-any.whl
Collecting beautifulsoup4>=4.5.1 (from news-please)
Using cached beautifulsoup4-4.5.3-py2-none-any.whl
Collecting readability-lxml>=0.6.2 (from news-please)
Using cached readability-lxml-0.6.2.tar.gz
Collecting langdetect>=1.0.7 (from news-please)
Using cached langdetect-1.0.7.zip
Collecting python-dateutil>=2.4.0 (from news-please)
Using cached python_dateutil-2.6.0-py2.py3-none-any.whl
Collecting plac>=0.9.6 (from news-please)
Using cached plac-0.9.6-py2.py3-none-any.whl
Collecting newspaper (from news-please)
Using cached newspaper-0.0.9.8.tar.gz
Collecting lxml (from Scrapy>=1.1.0->news-please)
Using cached lxml-3.7.3.tar.gz
Collecting PyDispatcher>=2.0.5 (from Scrapy>=1.1.0->news-please)
Using cached PyDispatcher-2.0.5.tar.gz
Collecting Twisted>=13.1.0 (from Scrapy>=1.1.0->news-please)
Using cached Twisted-17.1.0.tar.bz2
Collecting pyOpenSSL (from Scrapy>=1.1.0->news-please)
Using cached pyOpenSSL-16.2.0-py2.py3-none-any.whl
Collecting queuelib (from Scrapy>=1.1.0->news-please)
Using cached queuelib-1.4.2-py2.py3-none-any.whl
Collecting w3lib>=1.15.0 (from Scrapy>=1.1.0->news-please)
Using cached w3lib-1.17.0-py2.py3-none-any.whl
Collecting cssselect>=0.9 (from Scrapy>=1.1.0->news-please)
Using cached cssselect-1.0.1-py2.py3-none-any.whl
Collecting parsel>=1.1 (from Scrapy>=1.1.0->news-please)
Using cached parsel-1.1.0-py2.py3-none-any.whl
Collecting service-identity (from Scrapy>=1.1.0->news-please)
Using cached service_identity-16.0.0-py2.py3-none-any.whl
Collecting six>=1.5.2 (from Scrapy>=1.1.0->news-please)
Using cached six-1.10.0-py2.py3-none-any.whl
Collecting urllib3<2.0,>=1.8 (from elasticsearch>=2.4->news-please)
Using cached urllib3-1.20-py2.py3-none-any.whl
Collecting chardet (from readability-lxml>=0.6.2->news-please)
Using cached chardet-2.3.0-py2.py3-none-any.whl
Collecting Pillow==2.5.1 (from newspaper->news-please)
Using cached Pillow-2.5.1-cp27-none-macosx_10_6_intel.macosx_10_9_intel.macosx_10_9_x86_64.whl
Collecting PyYAML==3.11 (from newspaper->news-please)
Using cached PyYAML-3.11.zip
Collecting nltk==2.0.5 (from newspaper->news-please)
Using cached nltk-2.0.5.tar.gz
Collecting requests==2.3.0 (from newspaper->news-please)
Using cached requests-2.3.0-py2.py3-none-any.whl
Collecting jieba==0.35 (from newspaper->news-please)
Using cached jieba-0.35.zip
Collecting feedparser==5.1.3 (from newspaper->news-please)
Using cached feedparser-5.1.3.zip
Collecting tldextract==1.5.1 (from newspaper->news-please)
Using cached tldextract-1.5.1.tar.gz
Collecting feedfinder2==0.0.1 (from newspaper->news-please)
Using cached feedfinder2-0.0.1.tar.gz
Collecting zope.interface>=3.6.0 (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Using cached zope.interface-4.3.3-cp27-cp27m-macosx_10_11_x86_64.whl
Collecting constantly>=15.1 (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Using cached constantly-15.1.0-py2.py3-none-any.whl
Collecting incremental>=16.10.1 (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Using cached incremental-16.10.1-py2.py3-none-any.whl
Collecting Automat>=0.3.0 (from Twisted>=13.1.0->Scrapy>=1.1.0->news-please)
Using cached Automat-0.5.0-py2.py3-none-any.whl
Collecting cryptography>=1.3.4 (from pyOpenSSL->Scrapy>=1.1.0->news-please)
Using cached cryptography-1.7.2-cp27-cp27m-macosx_10_6_intel.whl
Collecting pyasn1 (from service-identity->Scrapy>=1.1.0->news-please)
Using cached pyasn1-0.2.2-py2.py3-none-any.whl
Collecting pyasn1-modules (from service-identity->Scrapy>=1.1.0->news-please)
Using cached pyasn1_modules-0.0.8-py2.py3-none-any.whl
Collecting attrs (from service-identity->Scrapy>=1.1.0->news-please)
Using cached attrs-16.3.0-py2.py3-none-any.whl
Collecting setuptools (from tldextract==1.5.1->newspaper->news-please)
Using cached setuptools-34.2.0-py2.py3-none-any.whl
Collecting idna>=2.0 (from cryptography>=1.3.4->pyOpenSSL->Scrapy>=1.1.0->news-please)
Using cached idna-2.2-py2.py3-none-any.whl
Collecting ipaddress (from cryptography>=1.3.4->pyOpenSSL->Scrapy>=1.1.0->news-please)
Using cached ipaddress-1.0.18-py2-none-any.whl
Collecting enum34 (from cryptography>=1.3.4->pyOpenSSL->Scrapy>=1.1.0->news-please)
Using cached enum34-1.1.6-py2-none-any.whl
Collecting cffi>=1.4.1 (from cryptography>=1.3.4->pyOpenSSL->Scrapy>=1.1.0->news-please)
Using cached cffi-1.9.1-cp27-cp27m-macosx_10_10_intel.whl
Collecting packaging>=16.8 (from setuptools->tldextract==1.5.1->newspaper->news-please)
Using cached packaging-16.8-py2.py3-none-any.whl
Collecting appdirs>=1.4.0 (from setuptools->tldextract==1.5.1->newspaper->news-please)
Using cached appdirs-1.4.0-py2.py3-none-any.whl
Collecting pycparser (from cffi>=1.4.1->cryptography>=1.3.4->pyOpenSSL->Scrapy>=1.1.0->news-please)
Using cached pycparser-2.17.tar.gz
Collecting pyparsing (from packaging>=16.8->setuptools->tldextract==1.5.1->newspaper->news-please)
Using cached pyparsing-2.1.10-py2.py3-none-any.whl
Installing collected packages: lxml, PyDispatcher, six, pyparsing, packaging, appdirs, setuptools, zope.interface, constantly, incremental, attrs, Automat, Twisted, idna, ipaddress, enum34, pyasn1, pycparser, cffi, cryptography, pyOpenSSL, queuelib, w3lib, cssselect, parsel, pyasn1-modules, service-identity, Scrapy, PyMySQL, hjson, urllib3, elasticsearch, beautifulsoup4, chardet, readability-lxml, langdetect, python-dateutil, plac, Pillow, PyYAML, nltk, requests, jieba, feedparser, tldextract, feedfinder2, newspaper, news-please
Running setup.py install for lxml ... error
Complete output from command /usr/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/gy/5xt04_452z791v1qjs1yzxkh0000gn/T/pip-build-z_Uoxu/lxml/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /var/folders/gy/5xt04_452z791v1qjs1yzxkh0000gn/T/pip-nxGrQI-record/install-record.txt --single-version-externally-managed --compile:
Building lxml version 3.7.3.
Building without Cython.
Using build configuration of libxslt 1.1.29
Building against libxml2/libxslt in the following directory: /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/lib
running install
running build
running build_py
creating build
creating build/lib.macosx-10.12-intel-2.7
creating build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/init.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/_elementpath.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/builder.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/cssselect.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/doctestcompare.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/ElementInclude.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/pyclasslookup.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/sax.py -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/usedoctest.py -> build/lib.macosx-10.12-intel-2.7/lxml
creating build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/init.py -> build/lib.macosx-10.12-intel-2.7/lxml/includes
creating build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/init.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/_diffcommand.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/_html5builder.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/_setmixin.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/builder.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/clean.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/defs.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/diff.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/ElementSoup.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/formfill.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/html5parser.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/soupparser.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
copying src/lxml/html/usedoctest.py -> build/lib.macosx-10.12-intel-2.7/lxml/html
creating build/lib.macosx-10.12-intel-2.7/lxml/isoschematron
copying src/lxml/isoschematron/init.py -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron
copying src/lxml/lxml.etree.h -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/lxml.etree_api.h -> build/lib.macosx-10.12-intel-2.7/lxml
copying src/lxml/includes/c14n.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/config.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/dtdvalid.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/etreepublic.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/htmlparser.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/relaxng.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/schematron.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/tree.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/uri.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/xinclude.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/xmlerror.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/xmlparser.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/xmlschema.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/xpath.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/xslt.pxd -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/etree_defs.h -> build/lib.macosx-10.12-intel-2.7/lxml/includes
copying src/lxml/includes/lxml-version.h -> build/lib.macosx-10.12-intel-2.7/lxml/includes
creating build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources
creating build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/rng
copying src/lxml/isoschematron/resources/rng/iso-schematron.rng -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/rng
creating build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl
copying src/lxml/isoschematron/resources/xsl/RNG2Schtrn.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl
copying src/lxml/isoschematron/resources/xsl/XSD2Schtrn.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl
creating build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_abstract_expand.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_dsdl_include.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_schematron_message.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_schematron_skeleton_for_xslt1.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/iso_svrl_for_xslt1.xsl -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
copying src/lxml/isoschematron/resources/xsl/iso-schematron-xslt1/readme.txt -> build/lib.macosx-10.12-intel-2.7/lxml/isoschematron/resources/xsl/iso-schematron-xslt1
running build_ext
building 'lxml.etree' extension
creating build/temp.macosx-10.12-intel-2.7
creating build/temp.macosx-10.12-intel-2.7/src
creating build/temp.macosx-10.12-intel-2.7/src/lxml
cc -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/libxml2 -Isrc/lxml/includes -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.etree.c -o build/temp.macosx-10.12-intel-2.7/src/lxml/lxml.etree.o -w -flat_namespace
cc -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -Wl,-F. build/temp.macosx-10.12-intel-2.7/src/lxml/lxml.etree.o -L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/lib -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.12-intel-2.7/lxml/etree.so
building 'lxml.objectify' extension
cc -fno-strict-aliasing -fno-common -dynamic -arch i386 -arch x86_64 -g -Os -pipe -fno-common -fno-strict-aliasing -fwrapv -DENABLE_DTRACE -DMACOSX -DNDEBUG -Wall -Wstrict-prototypes -Wshorten-64-to-32 -DNDEBUG -g -fwrapv -Os -Wall -Wstrict-prototypes -DENABLE_DTRACE -arch i386 -arch x86_64 -pipe -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/include/libxml2 -Isrc/lxml/includes -I/System/Library/Frameworks/Python.framework/Versions/2.7/include/python2.7 -c src/lxml/lxml.objectify.c -o build/temp.macosx-10.12-intel-2.7/src/lxml/lxml.objectify.o -w -flat_namespace
cc -bundle -undefined dynamic_lookup -arch i386 -arch x86_64 -Wl,-F. build/temp.macosx-10.12-intel-2.7/src/lxml/lxml.objectify.o -L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.12.sdk/usr/lib -lxslt -lexslt -lxml2 -lz -lm -o build/lib.macosx-10.12-intel-2.7/lxml/objectify.so
running install_lib
copying build/lib.macosx-10.12-intel-2.7/lxml/etree.so -> /Library/Python/2.7/site-packages/lxml
error: could not delete '/Library/Python/2.7/site-packages/lxml/etree.so': Permission denied


Command "/usr/bin/python -u -c "import setuptools, tokenize;file='/private/var/folders/gy/5xt04_452z791v1qjs1yzxkh0000gn/T/pip-build-z_Uoxu/lxml/setup.py';f=getattr(tokenize, 'open', open)(file);code=f.read().replace('\r\n', '\n');f.close();exec(compile(code, file, 'exec'))" install --record /var/folders/gy/5xt04_452z791v1qjs1yzxkh0000gn/T/pip-nxGrQI-record/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /private/var/folders/gy/5xt04_452z791v1qjs1yzxkh0000gn/T/pip-build-z_Uoxu/lxml/

Use this as a Python library

How can I use news-please in my code as a Python library? I would like to use this to crawl a list of sites continuously stored in a file.

improve ComparerTitle

In order to compare the extracted titles, the comparer creates a cartesian product and compares the titles in a tupel. This could lead to a problem regarding the performance speed of the program, if more extractors will be added.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.