Code Monkey home page Code Monkey logo

scrawler's People

Contributors

dependabot[bot] avatar dglttr avatar

Watchers

 avatar

scrawler's Issues

Export of web domain tree structure

scrawler already records the steps_from_start_page of individual websites. A structure recording which pages appear at which depth (from the start page) and on which sub-pages, in a sort of tree structure, could be extracted and saved in an XML file, for example.
This could be useful for understanding the structure of web domains.

Add dictionary to Website object to pass along information

Currently, the steps_from_start_page attribute serves as a way to save this specific kind of meta information on websites. However, more information could potentially be collected as meta information that is passed along with the Website object. An easy way would be to have a dedicated dictionary as attribute of the Website class, somewhat similar to the scrapy Request meta attribute.

More options for data exports

Additional options for data storage.

Ideas :

  • SQL databases
  • Data could be written directly during the scraping process, instead of at the end only.

Improve `export_attrs`

  • Make index specifiable as parameter
  • Ensure that parameters are not passed twice to to_csv() from within the kwargs (e. g. sep, index, header)

Website.http_response objects are different depending on backend used

Problem

The objects stored in the http_response attribute are requests.Reponse or aiohttp.ClientResponse objects, depending on whether the asyncio or multithreading backend is used.
These two object have slightly different attributes and available methods and thus make writing data extractors that work with both harder.

Possible solution(s)

  • Write some wrapper to support some defined functionality for both (and do method/attribute translation behind the scenes)
  • Settle on one backend (or a different one, in the future)

Investigate using scrapy as backend

Scrapy is a well-maintained, robust Python library for web scraping and crawling. Relying on scrapy on the backend could solve issues with concurrency and simplify the project, in order to focus on more high-level functionality such as data extractors.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.