diastro / zeek Goto Github PK
View Code? Open in Web Editor NEWPython distributed web scrapper and dynamic crawler
License: MIT License
Python distributed web scrapper and dynamic crawler
License: MIT License
Not only read back the error code but catch HTTPError exception too
ex :
HTTPError: HTTP Error 404: Not Found
Server first sends a configuration packet to all of its clients so that they share the same crawling settings.
Application will be able to parse a configuration file and use the needed values from it.
Server will be able to specify to either crawl a specific list of URL (static), crawl indefinitely (dynamic) or domain specific.
Server will inform node of the crawling type when sending the config info.
Hello
Please i am lost on how to view the output of the scrapped url
Software requirements specification (SRS)
Requirements traceability
Add the size of the queued url to an input in the configuration file
Handle :
Coomunication protocol
Mongo DB
Different action will need to be taken depending on the type of packet received
Socket communication wasn't using delimiters which cause (in rare cases since I test Zeek running multiple python process on the same server) the ServerSideClient to block on a read eventhough it had received more then a full packet.
Fix:
Added a delimiter and after receiving data we check to see if there's a delimiter in the buffer. If so, we take the complete packet that's in there.
rule.py and scrapping.py are not being reloaded from client.py but from original imported file.
Enhancement needed to reload them from client.py
Each connection has a different Thread (Server-side).
Creation of the object class that will be passed between the server node and the working nodes
Dynamic reload of imported module from server
Message targetted to specific client
Ouput result to file (csv)
Read all the URLs in a webpage
Create rules to prevent the scrapper to go visit certain urls:
Class structure containing all relevant data for a visited site:
Protocol needs to be able to send a list of URL and client needs to be able to interpret this list.
User manual - Deployment guide
Review of all the documentation :
Collect stats
URL dispatching and sorting (server)
Add proper error handling throughout the project
Working node needs to be able to scrape URLs from a web page and return the list of scrapped URLs to the server.
In the event that a URLs request isn't successful, the working node needs to return and inform the server.
Deployment of rule.py and scrapping.py to working nodes
Project "retrospective" report
exe_type need to be formated to remove < >
Repro :
Fix :
Handle CTR-C and force close all socket before exiting
Fix queue size.
Each serverSide client will have a fix number of URL dispatched to them (ie 20). Everytime a client replies after visiting a site a new url will be sent.
Architecture overview
Dispatching of urls :
ServerSide
Client:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.