Code Monkey home page Code Monkey logo

webcrawler's Introduction

Dependencies: The crawler needs to run with Python 3.6 and later. You need to install all the dependencies, such as bs4/beautifulsoup4, which is used for HTTP parsing. requests_mock is used to mock the HTTP requests in unittest. Coverage is used for unit testing and to get the test coverage.

To install all the dependencies, you can run: pip install -r requirements.txt

To crawl any website, you can run: python main.py url

To run the tests and coverage: python -m unittest discover coverage run -m unittest discover coverage report -m

The design of this app: This web crawler is a mini app. When you provide a small entry URL, the crawler will print all the URLs on the same domain, as well as all the links found on the pages.

This crawler will have two main storage components. -One is called "urls_queue," where we will put the entry URL and all the validated URLs. Every time we take a URL from the queue, we fetch the HTML content, parse it, extract the URLs from the content, validate them, and put the validated URLs into the urls_queue. -The other storage component is "visited_urls." Every time we extract a URL from urls_queue, we will add this URL to the visited_urls set. We do this because there might be a cycle, for example, we visit "a," get "b," then we visit "b" and get "a" again.

Some key points:

Concurrency: The WebCrawler uses multiprocessing to achieve concurrency. Without multithreading, one thread needs to wait a long time (around one second on my local machine) to get a response. Multithreading can save waiting time for responses, allowing the CPU to execute other threads. Multithreading makes the app much more efficient.

Testing: The unit tests cover 76% of the code, except lines 75-77, 81-84 in main.py (which is the args part; I don't think we need to test it :|).

Retry logic: For the requirements, I chose a strategy of retrying three times to visit a website. I chose this strategy because I don't think it's really necessary to save the URL and revisit it in the future in this case.

webcrawler's People

Contributors

luckybaobaobao avatar linabinary avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.