Yelp Web Scraper
- Atharva Anil Dastane
- Nov 16, 2022
- 1 min read
Updated: Nov 30, 2022
Introduction
What is Web Scraping?
Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.
What is Yelp?
Yelp's website, Yelp.com, is a crowd-sourced local business review and social networking site. The site has pages devoted to individual locations, such as restaurants or schools, where Yelp users can submit a review of their products or services using a one to five star rating scale. Businesses can also update contact information, hours, and other basic listing information or add special deals. In addition to writing reviews, users can react to reviews, plan events, or discuss their personal lives.
Problem Statement
Nowadays data is not easily available. Even if we acquire data it never comes in proper format. It is messy all its way. So I wanted to check out some of the best restaurants from different metropolitan cities in the US from yelp.com. I also wanted to check out what people say about those restaurants based on their experience. Instead of hovering over those webpages back and forth, I wanted to check out all the restaurants and their reviews in a go simultaneously. Here is my notebook that gives you a detailed overview of web scraping Yelp data in an Excel file.
Beautiful Soup
Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages. Say you’ve found some webpages that display data relevant to your research, such as date or address information, but that do not provide any way of downloading the data directly. Beautiful Soup helps you pull particular content from a webpage, remove the HTML markup, and save the information. It is a tool for web scraping that helps you clean up and parse the documents you have pulled down from the web.
Importing necessary libraries

csv - to store all the restaurant and reviews data to a csv.
requests - module to send http requests using Python.
re - module to use to regular expression, if needed.
numpy, pandas - for basic data manipulation.
Creation of 10 URLs to scrape 100 best restaurants from a particular City

Web Scraping Restaurants and thier Reviews
Function to get the Soup Data

Function takes in the URL to be scraped.
We create dummy headers so that Yelp does not block us from getting the data from the same IP address.
We send in the http request throught request.get(url) and store it in response.
Function to get the HTML tag and its class
Main Function to be executed to generate the final Excel files
Function to get Restaurants Attributes
Function to generate CSV for Restaurants Attribute/Info
Function to get the Reviews Data for every Restaurants
Function to get the list of URL's containing Reviews
Function to get the Reviews and Ratings of the Customers
Function to generate CSV for Reviews and Ratings
Main Function Execution
Output
Top 20 Boston Restaurants Details and Reviews sample output.


Output files attached for Boston and Seattle
I hope you enjoyed the Yelp Web Scraping journey!!
コメント