teamhg-memex / arachnado Goto Github PK

View Code? Open in Web Editor NEW

159.0 159.0 65.0 1.04 MB

Web Crawling UI and HTTP API, based on Scrapy and Tornado

Python 63.57% JavaScript 31.86% HTML 1.77% CSS 0.07% Julia 2.73%

arachnado's People

Contributors

Stargazers

Watchers

Forkers

rmax-archive kgrvamsi thangphuocnguyen yongliangliu barravi leezqcst mk-z yujun1993 liinnux jaistoped jaisanas theotheo micromanchg toysweet appleboy1977 nyimbi jerryzfc wiesel2 botzill alexxnica kryndex zhuwenxiao tomjie fanfanfeng onlywsx itahmid aperezalbela eniac888 hdsmtiger olivierh59500 yanchaomars bytearchive thezedwards bstester sh1nu11bi leeyis optionalg smilemilk1992 aitorcarrera soldierssword hubitor spidysenses tsw424 theblacksmithofmilan vishalbelsare toannd96 hctwgl escap-data-hub viktortat masterscott strategist922 zachary1874 jalewis zanachka wealthcreating d3v3l0 zxb789 fulanah-binti-fulanah easontom isabella232 liulinhere avengermojo whob1 poa00

arachnado's Issues

"Known sites" feature is broken

It looks like 66ef047 broke it because JS code (e.g. https://github.com/TeamHG-Memex/arachnado/blob/master/arachnado/static/js/utils/Rpc.js) was not updated as a part of this change.

Paused crawls are resumed after arachnado restart

If I pause a crawl, and then stop/start arachnado, the crawl continues ("unpauses") after starting arachnado - I would expect it to remain paused.

Need to add an option to upload the scrapy project and run the spider by arachnado

Hi all:
It may be difficult for user to config the spider written by themselves ,i think if we create an option then just need to upload the spider they have written , and run by the WebUI ,it will become a very useful tool .

Add an option to continue crawls after restart

I really like this project

I really like this project, I hope the author can be updated.

log_count/DEBUG keeps increasing after job is finished

Storage of scraped files

I have arachnado running on Ubuntu 14.04. The scraper appears to be working correctly but it doesn't save the scraped files to mongodb. I have mongo running on the default port and have created a db named 'arachnado'. Is there anything other than the configuration file to tell arachnado to store the files?

How to run custom spider?

Can't run arachnado

I use annacoda27 for Windows 64-bit and my machine is Windows 10 pro.

When I type arachnado in CMD And this is an issue like that:

2017-03-10 11:57:28 [tornado.application] ERROR: Exception in callback <bound method ProcessStatsMonitor._emit of <arachnado.process_stats.ProcessStatsMonitor object at 0x00000000041854A8>>
Traceback (most recent call last): File "c:\users\king\anaconda2\lib\site-packages\tornado\ioloop.py", line 1041, in _run return self.callback()
File "c:\users\king\anaconda2\lib\site-packages\arachnado\process_stats.py", line 61, in _emit 'num_fds': self.process.num_fds(),
AttributeError: 'Process' object has no attribute 'num_fds'

This information refresh per second

When I stop arachnado some infomations come out

2017-03-10 11:57:33 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2017-03-10 11:57:33 [tornado.application] ERROR: Exception in callback <functools.partial object at 0x0000000004247188>
Traceback (most recent call last):Traceback (most recent call last):
File "c:\users\king\anaconda2\lib\site-packages\tornado\ioloop.py", line 612, in _run_callback ret = gen.convert_yielded(ret)
File "c:\users\king\anaconda2\lib\site-packages\singledispatch.py", line 210, in wrapper return dispatch(args[0].__class__)(*args, **kw)
File "c:\users\king\anaconda2\lib\site-packages\singledispatch.py", line 191, in dispatch impl = _find_impl(cls, registry)
File "c:\users\king\anaconda2\lib\site-packages\singledispatch.py", line 142, in _find_impl mro = _compose_mro(cls, registry.keys())
File "c:\users\king\anaconda2\lib\site-packages\singledispatch.py", line 98, in _compose_mro bases = set(cls.__mro__)
AttributeError: class DeferredList has no attribute '__mro__'

I don't know what happend in the code.

Could you help me to point it out ?

Best wishes

Performance issues

The UI can become slow when many transfers are active.

Signals

I am looking at the code and see all the signals being re-mapped and hence Scrapy ExecutionEngine and Downloader being subclassed.

1- What is the need of these custom signals and re-implmenting signals?
2- There seems to be the use to web sockets for client communication, why isn't http api sufficient?
3- Does this thing obey regular scrapy middleware i.e. cookies, robots.txt etc middleware?
4- Does this uses splash as browser?
5- Are the auto login or FormRequest works with this?

Download sub pages of certain urls

It seems to download all pages no matter AAA.BBB/CCC/DDD or AAA.BBB/EEE is given.

It's helpful to me if download sub pages of certain urls. :)

arachnado doesn't work for www.onet.pl

The requestsare filtered out by the dupefilter. See also: scrapy/scrapy#1225

ModuleNotFoundError: No module named 'scrapy.xlib'

Complaining on Python 3 version 3.6
Install via pip

pip3 install arachnado

Traceback (most recent call last):
  File "/home/.local/bin/arachnado", line 8, in <module>
    sys.exit(run())
  File "/home/.local/lib/python3.6/site-packages/arachnado/__main__.py", line 201, in run
    opts=opts,
  File "/home/.local/lib/python3.6/site-packages/arachnado/__main__.py", line 58, in main
    from arachnado.site_checker import get_site_checker_crawler
  File "/home/.local/lib/python3.6/site-packages/arachnado/site_checker.py", line 8, in <module>
    from scrapy.xlib.tx import ResponseFailed
ModuleNotFoundError: No module named 'scrapy.xlib'

Build from source

Hello @TeamHG-Memex,
Could you please create documentation for developers to build from source and create extensions.
I have been working with scrapy for a while and wish to extend my support for arachnado with ideas and features which I could build.

Thanks

ImportError: No module named 'ConfigParser'

I was use pip install arachnado and report success, but occurs the error "ImportError: No module named 'ConfigParser'" when execute arachnado command.
Searched the internet and seems Python 3.5 using configparer to replace ConfigParser, but it says it supports Python 3.5, why it happened?
Does anyone know how to fix it?

bot_detector

site_checker.py has
from bot_detector.detector import Detector
I can not find this dependancy in requirements.txt?

Ability to add seed urls

Does arachnado have the ability to take in a list of seed urls? either from the interface or the api!