thatlittlegit-archive / webcrawler Goto Github PK
View Code? Open in Web Editor NEW:spider: Crawls websites for URLs, and stores them in a textfile.
License: GNU General Public License v3.0
:spider: Crawls websites for URLs, and stores them in a textfile.
License: GNU General Public License v3.0
get_urls_in_url
is monolithic - our entire web-crawler is basically this function in a loop. We need to separate it, preferably in different files, so we can implement more. For example, we need to log the data the crawler collects; but we would have to shove it somewhere in the indented pigsty that is called 'get_urls_in_url
'. We also need to rename it - it sounds like it finds https://google.com
in https://bing.com/https://google.com/
.
get_root_domain
ought to be refactored. Such a simple function, which I'm sure that could be compressed, is around 38 lines! We need to implement more Rust instead of basically porting JavaScript. It might help speed, too.
Crawler should be able to find URLs in txt files. Right now, they're indexed, but not parsed. This should be fixed, but I probably won't add it in the near future.
Crawler should be able to find URLs in CSS files (i.e. url(...)
). Right now, they're indexed, but not parsed. This should be fixed, but I probably won't add it in the near future.
html5ever is built by the Servo project, and thus is probably faster. Maybe we should try switching? I think we should do a speed comparison - I doubt the differences will be large.
I've attempted to write a test for _main_loop, using Iron. However, it seems to hang while creating a reqwest::Client
. The code is currently commented out, but it needs to be fixed.
This is currently waiting on this Stack Overflow question.
Crawler should be able to find URLs in JavaScript files (specifically strings). Right now, they're indexed, but not parsed. This should be fixed, but I probably won't add it in the near future.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.