Yelp Web Scraper

Atharva Anil Dastane
Nov 16, 2022
1 min read

Updated: Nov 30, 2022

Introduction

What is Web Scraping?

Web scraping is an automatic method to obtain large amounts of data from websites. Most of this data is unstructured data in an HTML format which is then converted into structured data in a spreadsheet or a database so that it can be used in various applications.

What is Yelp?

Yelp's website, Yelp.com, is a crowd-sourced local business review and social networking site. The site has pages devoted to individual locations, such as restaurants or schools, where Yelp users can submit a review of their products or services using a one to five star rating scale. Businesses can also update contact information, hours, and other basic listing information or add special deals. In addition to writing reviews, users can react to reviews, plan events, or discuss their personal lives.

Problem Statement

Nowadays data is not easily available. Even if we acquire data it never comes in proper format. It is messy all its way. So I wanted to check out some of the best restaurants from different metropolitan cities in the US from yelp.com. I also wanted to check out what people say about those restaurants based on their experience. Instead of hovering over those webpages back and forth, I wanted to check out all the restaurants and their reviews in a go simultaneously. Here is my notebook that gives you a detailed overview of web scraping Yelp data in an Excel file.

Beautiful Soup

Beautiful Soup is a Python library for getting data out of HTML, XML, and other markup languages. Say you’ve found some webpages that display data relevant to your research, such as date or address information, but that do not provide any way of downloading the data directly. Beautiful Soup helps you pull particular content from a webpage, remove the HTML markup, and save the information. It is a tool for web scraping that helps you clean up and parse the documents you have pulled down from the web.

Importing necessary libraries

csv - to store all the restaurant and reviews data to a csv.
requests - module to send http requests using Python.
re - module to use to regular expression, if needed.
numpy, pandas - for basic data manipulation.

Creation of 10 URLs to scrape 100 best restaurants from a particular City

Web Scraping Restaurants and thier Reviews

Function to get the Soup Data

Function takes in the URL to be scraped.
We create dummy headers so that Yelp does not block us from getting the data from the same IP address.
We send in the http request throught request.get(url) and store it in response.

Function to get the HTML tag and its class

Generalized function to get the HTML tag based on its class relevant to what we want.

Main Function to be executed to generate the final Excel files

Main Function which generates 2 CSV's containing all the relevant restaurants data (1st CSV) and thier reviews (2nd CSV).
Calls 3 methods - getHtmlTagData, getExtractedAttributeData, getMainReviewsData which will be describe below.
'li' and ' border-color--default__09f24__NPAKY' are the base level HTML tags and class from where we will scrape all the information about the restaurants.

Function to get Restaurants Attributes

Function to scrape name, rating, price, review count, cuisine of the restaurant.
find() - return the first element of given tag.
find_all() - return the all the element of given tag.
findChildren() - Extracts a list of Tag objects that match the given criteria. You can specify the name of the Tag and any attributes you want the Tag to have.
name of the restaurant is extracted from the website having HTML tag - 'href'.
Individual lists have been created for name, price, cuisine, avg_rating, rating_count, review_count and all these were appended to a dictionary with proper keys.

Function to generate CSV for Restaurants Attribute/Info

Function writes all the restaurant info from the final dictionary to an Excel file.
DictWriter maps the final dictionary onto output rows.

Function to get the Reviews Data for every Restaurants

Main function to get the reviews of the restaurants.
'h1' and 'css-1se8maq' are the base level HTML tag and class used to scrape name of the restaurant.
'li' and ' margin-b5__09f24__pTvws border-color--default__09f24__NPAKY' are the base level HTML tag and class used to scrape the reviews.

Function to get the list of URL's containing Reviews

Function to create URL's of reviews to scrape them for every restaurants and store it in rev_url_list.

Function to get the Reviews and Ratings of the Customers

Function to scrape individual ratings, thier date of review and the review itself.
get('aria-label') gives us the rating provided by the customer to the restautant.
Creating individual lists for each attributes and append all to a final review dictionary.