Code Monkey home page Code Monkey logo

fotocasa's Introduction

Fotocasa

Description

Scraping property details from https:/fotocasa.es/ and store it in Postgresql database.

Implementations

  1. Webscraping property details from fotocasa website
  2. Rotating proxy to bypass antibot mechanism of the websource
  3. Scrapy-Splash implementation for javascript content such as infinite scrolling.
  4. Model/Pipeline design and development for PostgresQL database.

Setup Environment Variables

In settings.py add the following configuration:

  1. Database connection

    DATABASE = {
        'drivername': 'postgres',
        'host': os.environ.get('DB_HOST', 'localhost'),
        'port': os.environ.get('DB_PORT', '5432'),
        'username': os.environ.get('DB_USERNAME', 'user'),
        'password': os.environ.get('DB_PASSWORD', 'pwd'),
        'database': os.environ.get('DB_NAME', 'mydb')
    }
    
  2. Database pipeline

    ITEM_PIPELINES = {
        'fotocasa.pipelines.PostgresDBPipeline': 330,
        
    
        'rotating_proxies.middlewares.RotatingProxyMiddleware': 610,
        'rotating_proxies.middlewares.BanDetectionMiddleware': 620,
    }
    
  3. Proxies path ROTATING_PROXY_LIST_PATH = 'fotocasa/proxies.txt'

Dependencies

  1. Install the following dependencies from requirements.txt pip install -r requirements.txt

    sqlalchemy
    psycopg2
    scrapy-rotating-proxies
    

Create eggfile

  1. Create setup.py file at the same level as scrapy.cfg file with content as:

    from setuptools import setup, find_packages
    setup(
        name='fotocasa',
        version='1.0',
        packages=find_packages(),
        install_requires=[
            'psycopg2',
            'sqlalchemy'
            'scrapy-rotating-proxies'
        ],
        entry_points={'scrapy': ['settings = fotocasa.settings']}
    )
    
  2. Execute python setup.py bdist_egg in folder at the same level as scrapy.cfg file

  3. Upload the eggfile into the scrapyd server using: curl http://localhost:6800/addversion.json -F project=fotocasa -F version=1.0 -F egg=@dist/fotocasa-1.0-py3.7.egg

fotocasa's People

Contributors

codewithpatch avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.