Code Monkey home page Code Monkey logo

arachnado's People

Contributors

fornarat avatar kmike avatar lopuhin avatar lukemaxwell avatar mehaase avatar shirk3y avatar zergey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arachnado's Issues

Storage of scraped files

I have arachnado running on Ubuntu 14.04. The scraper appears to be working correctly but it doesn't save the scraped files to mongodb. I have mongo running on the default port and have created a db named 'arachnado'. Is there anything other than the configuration file to tell arachnado to store the files?

Can't run arachnado

I use annacoda27 for Windows 64-bit and my machine is Windows 10 pro.

When I type arachnado in CMD And this is an issue like that:

2017-03-10 11:57:28 [tornado.application] ERROR: Exception in callback <bound method ProcessStatsMonitor._emit of <arachnado.process_stats.ProcessStatsMonitor object at 0x00000000041854A8>>
Traceback (most recent call last): File "c:\users\king\anaconda2\lib\site-packages\tornado\ioloop.py", line 1041, in _run return self.callback()
File "c:\users\king\anaconda2\lib\site-packages\arachnado\process_stats.py", line 61, in _emit 'num_fds': self.process.num_fds(),
AttributeError: 'Process' object has no attribute 'num_fds'

This information refresh per second

When I stop arachnado some infomations come out

2017-03-10 11:57:33 [scrapy.crawler] INFO: Received SIGINT, shutting down gracefully. Send again to force
2017-03-10 11:57:33 [tornado.application] ERROR: Exception in callback <functools.partial object at 0x0000000004247188>
Traceback (most recent call last):Traceback (most recent call last):
File "c:\users\king\anaconda2\lib\site-packages\tornado\ioloop.py", line 612, in _run_callback ret = gen.convert_yielded(ret)
File "c:\users\king\anaconda2\lib\site-packages\singledispatch.py", line 210, in wrapper return dispatch(args[0].__class__)(*args, **kw)
File "c:\users\king\anaconda2\lib\site-packages\singledispatch.py", line 191, in dispatch impl = _find_impl(cls, registry)
File "c:\users\king\anaconda2\lib\site-packages\singledispatch.py", line 142, in _find_impl mro = _compose_mro(cls, registry.keys())
File "c:\users\king\anaconda2\lib\site-packages\singledispatch.py", line 98, in _compose_mro bases = set(cls.__mro__)
AttributeError: class DeferredList has no attribute '__mro__'

I don't know what happend in the code.

Could you help me to point it out ?

Best wishes

Signals

I am looking at the code and see all the signals being re-mapped and hence Scrapy ExecutionEngine and Downloader being subclassed.

1- What is the need of these custom signals and re-implmenting signals?
2- There seems to be the use to web sockets for client communication, why isn't http api sufficient?
3- Does this thing obey regular scrapy middleware i.e. cookies, robots.txt etc middleware?
4- Does this uses splash as browser?
5- Are the auto login or FormRequest works with this?

Download sub pages of certain urls

It seems to download all pages no matter AAA.BBB/CCC/DDD or AAA.BBB/EEE is given.

It's helpful to me if download sub pages of certain urls. :)

ModuleNotFoundError: No module named 'scrapy.xlib'

Complaining on Python 3 version 3.6
Install via pip

pip3 install arachnado
Traceback (most recent call last):
  File "/home/.local/bin/arachnado", line 8, in <module>
    sys.exit(run())
  File "/home/.local/lib/python3.6/site-packages/arachnado/__main__.py", line 201, in run
    opts=opts,
  File "/home/.local/lib/python3.6/site-packages/arachnado/__main__.py", line 58, in main
    from arachnado.site_checker import get_site_checker_crawler
  File "/home/.local/lib/python3.6/site-packages/arachnado/site_checker.py", line 8, in <module>
    from scrapy.xlib.tx import ResponseFailed
ModuleNotFoundError: No module named 'scrapy.xlib'

Build from source

Hello @TeamHG-Memex,
Could you please create documentation for developers to build from source and create extensions.
I have been working with scrapy for a while and wish to extend my support for arachnado with ideas and features which I could build.

Thanks

ImportError: No module named 'ConfigParser'

I was use pip install arachnado and report success, but occurs the error "ImportError: No module named 'ConfigParser'" when execute arachnado command.
Searched the internet and seems Python 3.5 using configparer to replace ConfigParser, but it says it supports Python 3.5, why it happened?
Does anyone know how to fix it?

bot_detector

site_checker.py has
from bot_detector.detector import Detector
I can not find this dependancy in requirements.txt?

Ability to add seed urls

Does arachnado have the ability to take in a list of seed urls? either from the interface or the api!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.