teamhg-memex / arachnado Goto Github PK
View Code? Open in Web Editor NEWWeb Crawling UI and HTTP API, based on Scrapy and Tornado
Web Crawling UI and HTTP API, based on Scrapy and Tornado
It looks like 66ef047 broke it because JS code (e.g. https://github.com/TeamHG-Memex/arachnado/blob/master/arachnado/static/js/utils/Rpc.js) was not updated as a part of this change.
If I pause a crawl, and then stop/start arachnado, the crawl continues ("unpauses") after starting arachnado - I would expect it to remain paused.
Hi all:
It may be difficult for user to config the spider written by themselves ,i think if we create an option then just need to upload the spider they have written , and run by the WebUI ,it will become a very useful tool .
I really like this project, I hope the author can be updated.
I have arachnado running on Ubuntu 14.04. The scraper appears to be working correctly but it doesn't save the scraped files to mongodb. I have mongo running on the default port and have created a db named 'arachnado'. Is there anything other than the configuration file to tell arachnado to store the files?
How to run custom spider?
I use annacoda27 for Windows 64-bit and my machine is Windows 10 pro.
When I type arachnado
in CMD
And this is an issue like that:
2017-03-10 11:57:28 [tornado.application] ERROR: Exception in callback <bound method ProcessStatsMonitor._emit of <arachnado.process_stats.ProcessStatsMonitor object at 0x00000000041854A8>>
Traceback (most recent call last): File "c:\users\king\anaconda2\lib\site-packages\tornado\ioloop.py", line 1041, in _run return self.callback()
File "c:\users\king\anaconda2\lib\site-packages\arachnado\process_stats.py", line 61, in _emit 'num_fds': self.process.num_fds(),
AttributeError: 'Process' object has no attribute 'num_fds'
This information refresh per second
When I stop arachnado some infomations come out
2017-03-10 11:57:33 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2017-03-10 11:57:33 [tornado.application] ERROR: Exception in callback <functools.partial object at 0x0000000004247188>
Traceback (most recent call last):Traceback (most recent call last):
File "c:\users\king\anaconda2\lib\site-packages\tornado\ioloop.py", line 612, in _run_callback ret = gen.convert_yielded(ret)
File "c:\users\king\anaconda2\lib\site-packages\singledispatch.py", line 210, in wrapper return dispatch(args[0].__class__)(*args, **kw)
File "c:\users\king\anaconda2\lib\site-packages\singledispatch.py", line 191, in dispatch impl = _find_impl(cls, registry)
File "c:\users\king\anaconda2\lib\site-packages\singledispatch.py", line 142, in _find_impl mro = _compose_mro(cls, registry.keys())
File "c:\users\king\anaconda2\lib\site-packages\singledispatch.py", line 98, in _compose_mro bases = set(cls.__mro__)
AttributeError: class DeferredList has no attribute '__mro__'
I don't know what happend in the code.
Could you help me to point it out ?
Best wishes
The UI can become slow when many transfers are active.
I am looking at the code and see all the signals being re-mapped and hence Scrapy ExecutionEngine and Downloader being subclassed.
1- What is the need of these custom signals and re-implmenting signals?
2- There seems to be the use to web sockets for client communication, why isn't http api sufficient?
3- Does this thing obey regular scrapy middleware i.e. cookies, robots.txt etc middleware?
4- Does this uses splash as browser?
5- Are the auto login or FormRequest works with this?
It seems to download all pages no matter AAA.BBB/CCC/DDD or AAA.BBB/EEE is given.
It's helpful to me if download sub pages of certain urls. :)
The requestsare filtered out by the dupefilter. See also: scrapy/scrapy#1225
Complaining on Python 3 version 3.6
Install via pip
pip3 install arachnado
Traceback (most recent call last):
File "/home/.local/bin/arachnado", line 8, in <module>
sys.exit(run())
File "/home/.local/lib/python3.6/site-packages/arachnado/__main__.py", line 201, in run
opts=opts,
File "/home/.local/lib/python3.6/site-packages/arachnado/__main__.py", line 58, in main
from arachnado.site_checker import get_site_checker_crawler
File "/home/.local/lib/python3.6/site-packages/arachnado/site_checker.py", line 8, in <module>
from scrapy.xlib.tx import ResponseFailed
ModuleNotFoundError: No module named 'scrapy.xlib'
Hello @TeamHG-Memex,
Could you please create documentation for developers to build from source and create extensions.
I have been working with scrapy for a while and wish to extend my support for arachnado with ideas and features which I could build.
Thanks
I was use pip install arachnado and report success, but occurs the error "ImportError: No module named 'ConfigParser'" when execute arachnado command.
Searched the internet and seems Python 3.5 using configparer to replace ConfigParser, but it says it supports Python 3.5, why it happened?
Does anyone know how to fix it?
site_checker.py has
from bot_detector.detector import Detector
I can not find this dependancy in requirements.txt?
Does arachnado have the ability to take in a list of seed urls? either from the interface or the api!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.