Code Monkey home page Code Monkey logo

scrapehero / yellowpages-scraper Goto Github PK

View Code? Open in Web Editor NEW
66.0 5.0 64.0 17 KB

Yellowpages.com Web Scraper written in Python and LXML to extract business details available based on a particular category and location.

Home Page: https://www.scrapehero.com/how-to-scrape-business-details-from-yellowpages-com-using-python-and-lxml/

Python 100.00%
business-directory yellow-pages scraper lxml web-scraper python yellow-pages-scraper html parsing extract

yellowpages-scraper's Introduction

Yellow Pages Business Details Scraper

Yellowpages.com Web Scraper written in Python and LXML to extract business details available based on a particular category and location.

If you would like to know more about this scraper you can check it out at the blog post 'How to Scrape Business Details from Yellow Pages using Python and LXML' - https://www.scrapehero.com/how-to-scrape-business-details-from-yellowpages-com-using-python-and-lxml/

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Fields to Extract

This yellow pages scraper can extract the fields below:

  1. Rank
  2. Business Name
  3. Phone Number
  4. Business Page
  5. Category
  6. Website
  7. Rating
  8. Street name
  9. Locality
  10. Region
  11. Zipcode
  12. URL

Prerequisites

For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML. Below are the package requirements:

  • lxml
  • requests

Installation

PIP to install the following packages in Python (https://pip.pypa.io/en/stable/installing/)

Python Requests, to make requests and download the HTML content of the pages (http://docs.python-requests.org/en/master/user/install/)

Python LXML, for parsing the HTML Tree Structure using Xpaths (Learn how to install that here โ€“ http://lxml.de/installation.html)

Running the scraper

We would execute the code with the script name followed by the positional arguments keyword and place. Here is an example to find the business details for restaurants in Boston. MA.

python3 yellow_pages.py restaurants Boston,MA

Sample Output

This will create a csv file:

Sample Output

yellowpages-scraper's People

Contributors

scrapehero-code avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

yellowpages-scraper's Issues

Was only able to get this to run once..

Was only able to get this to run once via google colab, the next day i tried im getting an error that the page failed to process.

!git clone https://github.com/scrapehero/yellowpages-scraper.git

!sudo apt-get install python3-lxml
!pip install requests lxml unicodecsv argparse

!ls

%cd yellowpages-scraper

!python3 yellow_pages.py restaurants Boston,MA

retrieving  https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=Boston,MA
Failed to process page

Recommended tool for importing to sql database

This is a fantastic script for what I need and I was curious if there were recommended tools for importing the CSV data to a SQL database. I'm on MAC so I was curious if there were recommendations or if we could build that functionality in?

I have a preference for PostgreSQL or MySQL and would be willing to help build it as an add-on function.

Scrape more than first page

Currently it only returns the first 30 hits. You can add a page attribute to get the results from a specific page, like this:

url = "https://www.yellowpages.com/search?search_terms={0}&geo_location_terms={1}&page={2}".format(keyword, place, page_count)

You can then do this in a while-loop, incrementing page_count and breaking when len(listings) == 0. This requires storing the results outside the loop and returning them after the loop.
In addition, phone-numbers, street and more doesn't seem to work, the xpath-queries should be like this:

XPATH_TELEPHONE = ".//div[@class='info']//div[contains(@class,'info-section info-secondary')]//div[contains(@class, 'phone')]//text()"
XPATH_STREET = ".//div[@class='info']//div[contains(@class,'info-section info-secondary')]//div[contains(@class, 'street-address')]//text()"
XPATH_LOCALITY = ".//div[@class='info']//div[contains(@class,'info-section info-secondary')]//div[contains(@class, 'locality')]//text()"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.