dglttr / scrawler Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 3.0 10.39 MB

A Python package for Systematic Web Scraping and Crawling

License: Other

Python 99.44% HTML 0.56%

asyncio crawling multithreading python web-scraping

scrawler's People

Contributors

Watchers

Forkers

arenabox babajideowoyele peenalguptasrh

scrawler's Issues

Export of web domain tree structure

scrawler already records the steps_from_start_page of individual websites. A structure recording which pages appear at which depth (from the start page) and on which sub-pages, in a sort of tree structure, could be extracted and saved in an XML file, for example.
This could be useful for understanding the structure of web domains.

Add dictionary to Website object to pass along information

Currently, the steps_from_start_page attribute serves as a way to save this specific kind of meta information on websites. However, more information could potentially be collected as meta information that is passed along with the Website object. An easy way would be to have a dedicated dictionary as attribute of the Website class, somewhat similar to the .

More options for data exports

Additional options for data storage.

Ideas :

SQL databases
Data could be written directly during the scraping process, instead of at the end only.

Investigate Responsible Web Scraping enhancements

Consider adding functionality related to responsible web scraping/crawling.

Respect crawl delay and other non-standard info from robots.txt
Include info from response headers and meta tags (see here)

Reference: https://www.zyte.com/blog/how-to-crawl-the-web-politely-with-scrapy/

Reduce redundancy between `Crawler` and `Scraper`

Both classes share similar functionalities, e. g. the request and retrieval or website data, or storing it. Following DRY, this redundancy should be reduced.

Improve `export_attrs`

Make index specifiable as parameter
Ensure that parameters are not passed twice to to_csv() from within the kwargs (e. g. sep, index, header)

Website.http_response objects are different depending on backend used

Problem

The objects stored in the http_response attribute are requests.Reponse or aiohttp.ClientResponse objects, depending on whether the asyncio or multithreading backend is used.
These two object have slightly different attributes and available methods and thus make writing data extractors that work with both harder.

Possible solution(s)

Write some wrapper to support some defined functionality for both (and do method/attribute translation behind the scenes)
Settle on one backend (or a different one, in the future)

Improve testing coverage

Add integration tests using test sites (e. g. https://toscrape.com/, https://crawler-test.com/)
Add more unittests, e. g. for extract_same_host_pattern().

Investigate using scrapy as backend

Scrapy is a well-maintained, robust Python library for web scraping and crawling. Relying on scrapy on the backend could solve issues with concurrency and simplify the project, in order to focus on more high-level functionality such as data extractors.

dglttr / scrawler Goto Github PK

scrawler's People

Contributors

Watchers

Forkers

scrawler's Issues

Export of web domain tree structure

Add dictionary to Website object to pass along information

More options for data exports

Investigate Responsible Web Scraping enhancements

Reduce redundancy between `Crawler` and `Scraper`

Improve `export_attrs`

Website.http_response objects are different depending on backend used

Problem

Possible solution(s)

Improve testing coverage

Investigate using scrapy as backend

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent