scrapehero / yellowpages-scraper Goto Github PK

Yellowpages.com Web Scraper written in Python and LXML to extract business details available based on a particular category and location.

Home Page: https://www.scrapehero.com/how-to-scrape-business-details-from-yellowpages-com-using-python-and-lxml/

Python 100.00%

business-directory yellow-pages scraper lxml web-scraper python yellow-pages-scraper html parsing extract

yellowpages-scraper's Introduction

Yellow Pages Business Details Scraper

Yellowpages.com Web Scraper written in Python and LXML to extract business details available based on a particular category and location.

If you would like to know more about this scraper you can check it out at the blog post 'How to Scrape Business Details from Yellow Pages using Python and LXML' - https://www.scrapehero.com/how-to-scrape-business-details-from-yellowpages-com-using-python-and-lxml/

Getting Started

These instructions will get you a copy of the project up and running on your local machine for development and testing purposes.

Fields to Extract

This yellow pages scraper can extract the fields below:

Rank
Business Name
Phone Number
Business Page
Category
Website
Rating
Street name
Locality
Region
Zipcode
URL

Prerequisites

For this web scraping tutorial using Python 3, we will need some packages for downloading and parsing the HTML. Below are the package requirements:

lxml
requests

Installation

PIP to install the following packages in Python (https://pip.pypa.io/en/stable/installing/)

Python Requests, to make requests and download the HTML content of the pages (http://docs.python-requests.org/en/master/user/install/)

Python LXML, for parsing the HTML Tree Structure using Xpaths (Learn how to install that here – http://lxml.de/installation.html)

Running the scraper

We would execute the code with the script name followed by the positional arguments keyword and place. Here is an example to find the business details for restaurants in Boston. MA.

python3 yellow_pages.py restaurants Boston,MA

Sample Output

This will create a csv file:

Sample Output

yellowpages-scraper's People

Contributors

Stargazers

Watchers

Forkers

basedglye andy183 nhatnguyen12 mindaugasvaitkus2 hajro92 nehm716 clsulli kengardiner igeekuplay mrlantin seunboi4u derekamsterdam doolingdavid msuvojit clever-scientist chiragnarayanakere egoomni romanreigns amit2014 dimatemnikov jackiehe1 kendalled gautamsharma0095 job-asfaw maxwellcharrington bertds coreyevanf gaybro8777 raghuram-kukun markdeng206 snatch8mm shajeelthottathil bgoandoholdings kaylathomas avyadro new-silvermoon mhmmd-nauman underscore-bama daltonrpruitt arsalanzia1 jimmorris123 kkk013 sajjad-yousuf-96 geeky00 santhoshashokkumar hishamyum gardnmi charles1790 dmitriides wololo4 tct123 kernelhacks arvind-india arifbrur itcrew vegarg77 jason-s-wu agencymatcher crosenblum italianrich oshkoshbagoshh malakadly

yellowpages-scraper's Issues

Was only able to get this to run once..

Was only able to get this to run once via google colab, the next day i tried im getting an error that the page failed to process.

!git clone https://github.com/scrapehero/yellowpages-scraper.git

!sudo apt-get install python3-lxml
!pip install requests lxml unicodecsv argparse

!ls

%cd yellowpages-scraper

!python3 yellow_pages.py restaurants Boston,MA

retrieving  https://www.yellowpages.com/search?search_terms=restaurants&geo_location_terms=Boston,MA
Failed to process page

can add email inside?

running through a proxy and get more then first page of data.

How do i get this to run through a proxy server?

also trying to make it loop so it can pull more then the first page.

Recommended tool for importing to sql database

This is a fantastic script for what I need and I was curious if there were recommended tools for importing the CSV data to a SQL database. I'm on MAC so I was curious if there were recommendations or if we could build that functionality in?

I have a preference for PostgreSQL or MySQL and would be willing to help build it as an add-on function.

Scrape more than first page

Currently it only returns the first 30 hits. You can add a page attribute to get the results from a specific page, like this:

url = "https://www.yellowpages.com/search?search_terms={0}&geo_location_terms={1}&page={2}".format(keyword, place, page_count)

You can then do this in a while-loop, incrementing page_count and breaking when len(listings) == 0. This requires storing the results outside the loop and returning them after the loop.
In addition, phone-numbers, street and more doesn't seem to work, the xpath-queries should be like this:

XPATH_TELEPHONE = ".//div[@class='info']//div[contains(@class,'info-section info-secondary')]//div[contains(@class, 'phone')]//text()"
XPATH_STREET = ".//div[@class='info']//div[contains(@class,'info-section info-secondary')]//div[contains(@class, 'street-address')]//text()"
XPATH_LOCALITY = ".//div[@class='info']//div[contains(@class,'info-section info-secondary')]//div[contains(@class, 'locality')]//text()"

scrapehero / yellowpages-scraper Goto Github PK

yellowpages-scraper's Introduction

Yellow Pages Business Details Scraper

Getting Started

Fields to Extract

Prerequisites

Installation

Running the scraper

Sample Output

yellowpages-scraper's People

Contributors

Stargazers

Watchers

Forkers

yellowpages-scraper's Issues

Recommend Projects

Recommend Topics

Recommend Org