Code Monkey home page Code Monkey logo

skroutz-specs-scraper's Introduction

Skroutz Specs Scraper

A simple specs scraper for skroutz.gr made in golang that can crawl product pages and output the results in a different formats, such as CSV.

Originally it was created to help me choose the best aircondition unit to buy for my apartment. However it is built modular enough with support of custom Filters, Store Parsersso that it can crawl and parse other products and shops as well.

In addition it supports local caching of requests in the cache directory to ensure that it does not burden the website with requests for products that have already been crawled before. The cache at the time being does not have an expiry date, if you want to recrawl the website because you believe something has been updated / changed in the products, just delete it and it will be recreated on the next run.

Usage

  1. Copy the URL of skroutz.gr that lists the products you want to crawl. For example https://www.skroutz.gr/c/407/oikiaka-klimatistika-inverter.html

  2. If you have golang already installed in your system you can simply run

    go run . https://www.skroutz.gr/c/407/oikiaka-klimatistika-inverter.html > data.csv
    

to crawl the products and export them into a csv file named data.csv. In the future I might also compile OS specific binaries so that you don't have to install golang.

Attention: Skroutz Cookie

skroutz.gr will attempt to block automated software such as bots and crawlers from reading data from its website. It will do so after a number of requests that the program has sent, and will present a page with a CAPTCHA to be solved by a human. Naturally the crawler cannot solve this on its own but there is a work around. If you open skroutz.gr in your browser, (and provided you are savvy enough to use the browser's Dev Tools), copy the Cookie that skroutz.gr has created and paste it in the skroutz_cookie field at config.yml.

Configuration using config.yml

TODO

Development

Testing

To ensure the application runs as expected run the automated tests that have been created so far

go test ./...

And the expected output will be something like:

?       skroutz-specs-scraper   [no test files]
?       skroutz-specs-scraper/crawler   [no test files]
?       skroutz-specs-scraper/exporter  [no test files]
ok      skroutz-specs-scraper/filter    0.324s
ok      skroutz-specs-scraper/parser    0.471s

Adding more shop parsers

TODO

Adding more filters

TODO

skroutz-specs-scraper's People

Contributors

dpsarrou avatar

Watchers

 avatar

Forkers

louigigr

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.