Code Monkey home page Code Monkey logo

redfin-scraper's Introduction

redfin-scraper

redfin-scraper is a proxy-based scraper to extract properties from Redfin with filters. It is especially useful when you want to crawl all recently sold properties (e.g., properties sold in past 3 years) in a given state or city.

Scraping Algorithm

Please refer to algorithm_sketch.md.

Prerequisites

  1. Have sqlite installed. If you are using mac, you do not need to install.
  2. Your OS system has python 3.6
  3. You have a file of proxies. You can buy proxies online, or use a free service like proxybroker. The repo assumes the use of proxies with user and password authorization. If your proxies do not need authorization, you can just have the csv file like
ip,port
a.b.c.d,2345
e.f.g.h,1234
...

Otherwise, your csv proxy file can be

ip,port,user,password
a.b.c.d,2345,user1,pass1
e.f.g.h,1234,user2,pass2
...

Environment Setup

  1. Create Python virtual environment first with python3.
python3.6 -m venv /path/to/venv
  1. Activate venv.
source /path/to/venv/bin/activate
  1. pip install -r requirements.txt

How to use

Once you successfully have all the prerequisites ready and set up the Python environment, you can scrape the Redfin data based on your needs. In the following I will demonstrate redfin-scraper usage by scraping a small city called Belmont (https://www.redfin.com/city/1362/CA/Belmont).

Property Summary URLs Only

If you want to get all Redfin summary URLs in a given city, you can just run

python redfin_crawler.py proxy.csv https://www.redfin.com/city/1362/CA/Belmont
--property_prefix https://www.redfin.com/city/1362/CA/Belmont --type pages

Scraping Property Details

If you need to get the property details, you can just run with type properties. This will not only generate the summary URLs containing the properties, but extract the property metadata from those urls.

python redfin_crawler.py good_proxies.csv https://www.redfin.com/city/1362/CA/Belmont
--property_prefix https://www.redfin.com/city/1362/CA/Belmont --type properties

Known Issues and Bugs

Safe folk issue on Mac

If Mac user experiences errors like

may have been in progress in another thread when fork() was called.
We cannot safely call it or ignore it in the fork() child process. Crashing instead

Try setting the following env before running the program

export OBJC_DISABLE_INITIALIZE_FORK_SAFETY=YES

Scraping with proxies returns 403 error code.

Most likely this proxy is blocked by the detection algorithm of the corresponding websites. You can temporarily remove the proxy out of your proxy pool.

But how do I know whether a proxy is good or not?

I put a proxy_checker.py in the tools repo. You can use this script to eliminate the proxies that are currently blocked by external website. To use, run

python tools/proxy_checker.py --proxy_csv_path proxy.csv

Disclaimer

Scraping websites can violate website term of service. Use at your own risk.

TODO

  1. Add free proxy integration so no external proxy file is needed.
  2. Make it a package so users can easily install it with pip.
  3. Add Docker environment.

redfin-scraper's People

Contributors

wang-ye avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.