Code Monkey home page Code Monkey logo

washeuteessen-crawler_and_parser's Introduction

webcrawler

Use scrapy to crawl title, url, image-url, ingredients and text of german recipe pages:

Domain total number of recipes
1 www.chefkoch.de 330.000 X
2 www.ichkoche.at 150.000 X
3 www.daskochrezept.de 87.000
4 www.eatsmarter.de 83.000 X
5 www.lecker.de 60.000 X
6 www.essen-und-trinken.de 30.000 X
7 www.rewe.de 4.000
8 www.rapunzel.de ???
9 www.springlane.de ???
10 www.proveg.com ???
11 www.eat-this.org ???
12 www.veganheaven.de ???
13 www.youtube.de ???
14 www.brigitte.de/rezepte ???
15 www.angebrannt.de ???
16 www.frag-mutti.de ???
17 www.livingathome.de/kochen-feiern/rezepte/archiv ???
18 www.kochbar.de ???
19 www.cocktails.de ???
20 www.backenmachtgluecklich.de ???
21 www.womenshealth.de 2.190 X

Local run

  1. Adapt host of mongo client

  2. build image from local docker file

    docker build .
  3. run image

    docker run --env SPIDER_NAME=chefkoch image

Deploy to Openshift

  1. verify your logged in and select the correct namespace

    $ oc projects
    $ oc project *namespace*
  2. start build with

    $ oc start-build washeuteessen-crawler --from-dir=. --follow
  3. verify result

    $ oc describe dc/washeuteessen-crawler
    
    ...
    Deployment #3 (latest):
    	Name:		washeuteessen-crawler-X
    	Created:	2 minutes ago
    	Status:		Active
    ...
    

Run cronjob on OpenShift

  1. Show existing all cronjobs
oc get cronjobs
  1. Add new cronjob (examble crawler-ichkoche)
$ oc run crawler-ichkoche --image=docker-registry.default.svc:5000/washeuteessen-test \
    --schedule='3919**THU' \
    --restart=Never  
  1. Abort runnning cronjob (pod)
oc delete pod <pod name>
  1. Edit existing cronjob
oc edit cronjob <cronjob name>

Examble cronjob settings- crawler-ichkoche

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: batch/v1beta1
kind: CronJob
metadata:
  creationTimestamp: 2019-07-11T19:38:21Z
  name: crawler-ichkoche
  namespace: washeuteessen-test
  resourceVersion: "4935178"
  selfLink: /apis/batch/v1beta1/namespaces/washeuteessen-test/cronjobs/crawler-ichkoche
  uid: 70ad381d-a413-11e9-970b-96000028fad7
spec:
  concurrencyPolicy: Allow
  failedJobsHistoryLimit: 1
  jobTemplate:
    metadata:
      creationTimestamp: null
    spec:
      template:
        metadata:
          creationTimestamp: null
          labels:
            parent: crawler-ichkoche
        spec:
          containers:
          - env:
            - name: SPIDER_NAME
              value: ichkoche
            image: docker-registry.default.svc:5000/washeuteessen-test/washeuteessen-crawler:0.2
            imagePullPolicy: IfNotPresent
            name: crawler
            resources: {}
            terminationMessagePath: /dev/termination-log
            terminationMessagePolicy: File
          dnsPolicy: ClusterFirst
          restartPolicy: Never
          schedulerName: default-scheduler
          securityContext: {}
          terminationGracePeriodSeconds: 30
  schedule: 39 19 * * THU
  successfulJobsHistoryLimit: 3
  suspend: false
status:
  active:
  - apiVersion: batch/v1
    kind: Job
    name: crawler-ichkoche-1563478740
    namespace: washeuteessen-test
    resourceVersion: "4935177"
    uid: b2e0114b-a993-11e9-970b-96000028fad7
  lastScheduleTime: 2019-07-18T19:39:00Z

washeuteessen-crawler_and_parser's People

Contributors

arndt-s avatar lena-kuhn avatar valentinkuhn avatar

Watchers

 avatar  avatar  avatar

washeuteessen-crawler_and_parser's Issues

Duplicate Documents in Database

Check if document already in DB and dont insert again

12000 duplicates found
ToDo:

  • dump db locally (oc rsync)
  • delete duplicactes
  • monitor number of duplicates in grafana/prometheus

start crawling

hey! great project!

i am interested in the recipes from chefkoch and others. i tried installing the docker and run the branch develop, yet the code runs and constantly outputs:

pymongo.errors.ServerSelectionTimeoutError: mongo:27017: [Errno -2] Name or service not known
where is the data stored in the end? it did show some recipes to be crawled.

thanks for your help!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.