Proxy locations

Europe

North America

South America

Asia

Africa

Oceania

See all locations

Network statusCareers

Back to blog

How to Scrape Yelp Data: Tutorial

Enrika Pavlovskytė

2023-08-245 min read
Share

Yelp is not only a place to find your next best dinner place. With years of compiling crowd-sourced content, Yelp is the perfect place to look for business data that can provide insights into local economic trends. 

In this article, why do businesses scrape Yelp data and what are the benefits of it? We will also show you a solution that can efficiently extract Yelp data at any scale.

Why scrape Yelp?

As mentioned above, throughout years of operating, Yelp has created a unique dataset filled with business details that can be scraped and used for various purposes – from journalism and academic research to business operations. On the business side, you might want to consider Yelp data for such things as:

  • Customer sentiment analysis

  • Market research 

  • Competitor analysis

  • Location planning 

  • Reputation management

What data can be scraped from Yelp?

There is a variety of business data that can be gathered from Yelp, but you first need to be familiar with the types of pages that you will be scraping – the search page and the business page.

The search page will provide you with an overview of all the local businesses that fit your search criteria. This means that you will be able to collect such data as:

  • Business name

  • URL

  • Reviews count

  • Rating

  • Tags

The business page, on the other hand, offers more detailed information about one specific business. This means that you get the information from the search page plus a more in-depth look at such things as reviews. In a nutshell, you can scrape:

  • Business name

  • URL

  • Extended review information

  • Contact information

  • Opening times

  • Amenities information 

So, Yelp offers a variety of data that can be used to uncover excellent business opportunities.

Project setup

For this tutorial, you’ll be using Python, so please download the latest version from their official website if you don’t have it already.

1. Installing dependencies

Next, you’ll have to install some libraries. All these libraries are available in the Python Package Index, so you can install them using a single command given below:

pip install requests bs4 pandas

2. Importing libraries

Now, import all the freshly installed libraries:

from bs4 import BeautifulSoup
import pandas as pd
import requests

As you can see, all three libraries are imported, and ready to use. The requests module will help you to send network requests. Once the server responds to the network requests, the BeautifulSoup module will be used for parsing the HTML content from the response object. And, pandas library will convert the parsed data into a CSV file.

3. Setting up Web Scraper API

To make things easier, we will use Oxylabs’ Web Scraper API which allows users to extract data from any website. Its main pros are that it has a built-in proxy rotator, custom device type, JavaScript rendering, etc. This means that there is significantly less chance that your scraping operations will encounter IP blocks or CAPTHCAs. 

 Let’s quickly look at the various parameters available to you.

Parameter Description
source Data source. For Yelp it should be set to `universal`. This parameter is required.
url Yelp or any other website URL. This parameter is also required.
user_agent_type Configure device type and browser.
geo_location Custom proxy based on specified geolocation
locale Configure Accept-Language Header
render Enables JavaScript-based rendering
callback_url URL to your callback endpoint (if any)
parse If set to `true` returns structured data using the given `parsing_instructions`.
parsing_instructions Defines custom parsing & data transformation logic to be executed on HTML
context:headers Customize Headers
context:http_method Customize HTTP methods. I.e. `POST`
context:session_id Allows the same proxy on multiple requests for 10 minutes.
context:cookies Allows custom cookies

Also, check out the complete list of parameters here

To use the Oxylabs Web Scraper API, you’ll need an Oxylabs account. Use your API user credentials and prepare a payload. The code will be similar to the one below:

page = "https://www.yelp.com/biz/memento-sf-san-francisco-3"
payload = {
    "source": "universal",
    "render": "html",
    "user_agent_type": "desktop",
    "url": page,
}
credentials = ("USERNAME", "PASSWORD")

4. Scraping Yelp business page

Once the payload is ready with the page url, you must send a POST request to the Web Scraper API. 

Don’t forget to pass the authentication credentials.

page = "https://www.yelp.com/biz/memento-sf-san-francisco-3"
payload = {
    "source": "universal",
    "render": "html",
    "user_agent_type": "desktop",
    "url": page,
}
credentials = ("USERNAME", "PASSWORD")

response = requests.post(
    "https://realtime.oxylabs.io/v1/queries",
    auth=credentials,
    json=payload,
)
print(response.status_code)

If everything works as expected, you should see a response code 200. If you get any other response code then please see the documentation.

Inspecting elements

Before you start parsing the business page, let’s open it using a Web Browser and use the developer tool to identify the necessary CSS Selectors. You can right-click and select Inspect or just press CTRL + SHIFT + I (Windows) or ⌥ + ⌘ + I (macOS) to open developer tools. It should look like that:

Next, let’s find the CSS selectors for the name, reviews, rating, working hours and location elements.

As you can see, the name is available in an <h1> tag. Similarly, the rating is available in a span element with class css-1fdy0l5:

The review is available in an <a> tag with class css-19v1rkv:

And the address element has an <address> tag:

Last but not least, the working hours are wrapped in a <table> element:

Parsing data

Now using all this information, you can start writing the parser with Beautiful Soup. The Web Scraper API returns a JSON response in which the HTML code is available in the content property. 

For both location and working_hours, you’ll have to extract the text of all the child elements of the parent element. Fortunately, Beautiful Soup has a get_text() method which you can use for such cases.

data = []

soup = BeautifulSoup(response.json()["results"][0]["content"], "html.parser")
name = soup.find("h1").text
rating = soup.find("span", class_="css-1fdy0l5").text
review = soup.find("a", class_="css-19v1rkv").text
location = soup.find("address").get_text(strip=True)
working_hours = soup.find("table").get_text(strip=True)

Saving data to a CSV file

After parsing the data, you can save everything in a structured CSV file by appending the parsed data to a data list and then using pandas to export the list into CSV:

data.append({
    "name": name,
    "rating": rating,
    "review": review,
    "location": location,
    "working hours": working_hours,
})

df = pd.DataFrame(data)
df.to_csv("yelp_business_data.csv", index=False)

Full source code

from bs4 import BeautifulSoup
import pandas as pd
import requests

page = "https://www.yelp.com/biz/memento-sf-san-francisco-3"
payload = {
    "source": "universal",
    "render": "html",
    "user_agent_type": "desktop",
    "url": page,
}
credentials = ("USERNAME", "PASSWORD")

response = requests.post(
    "https://realtime.oxylabs.io/v1/queries",
    auth=credentials,
    json=payload,
)
print(response.status_code)

data = []

soup = BeautifulSoup(response.json()["results"][0]["content"], "html.parser")
name = soup.find("h1").text
rating = soup.find("span", class_="css-1fdy0l5").text
review = soup.find("a", class_="css-19v1rkv").text
location = soup.find("address").get_text(strip=True)
working_hours = soup.find("table").get_text(strip=True)

data.append({
    "name": name,
    "rating": rating,
    "review": review,
    "location": location,
    "working hours": working_hours,
})

df = pd.DataFrame(data)
df.to_csv("yelp_business_data.csv", index=False)

And that’s it! You’ve successfully extracted the content of a Yelp Business page. 

5. Scraping Yelp search result page

You can also extract data from Yelp search results page using the Web Scraper API. As above, all you need to do is inspect elements using the developer tool and gather the appropriate CSS selectors from the Yelp search page. 

Inspecting elements

Open the Yelp search result page in a web browser and use the developer tool to inspect the elements. Notice each of the search results is wrapped inside a div with a unique property data-testid=’serp-ia-card’:

Now, you can inspect each of the elements and find the CSS selectors for name, review count, rating, neighborhood and URL. For example, the name is wrapped in an <a> tag which is enclosed with an <h3> tag:

Similarly, you can find the rest of the CSS selectors using the developer tool. For your convenience, all of them are given below

Name CSS Selector
name h3.a
rating span.css-gutk1c
review count span.css-chan6m
neighborhood div.css-1kiyre6 span.css-chan6m
url h3.a

Parsing search results

Now that you’ve all the necessary CSS selectors, you can start parsing the data from the HTML content. You can use the Web Scraper API the same way you did for the Yelp Business page. 

Once the HTML content is extracted, use Beautiful Soup to parse it further and extract div elements. Then, you can use a for loop to extract content from each of the div elements. The full source code is given below:

from bs4 import BeautifulSoup
import pandas as pd
import requests


page = "https://www.yelp.com/search?find_desc=Restaurants&find_loc=San%20Francisco%2C%20CA"
payload = {
    "source": "universal",
    "render": "html",
    "url": page,
}

credentials = ("USERNAME", "PASSWORD")

response = requests.post(
    "https://realtime.oxylabs.io/v1/queries",
    auth=credentials,
    json=payload,
)
print(response.status_code)

data = []

for result in response.json()["results"]:
    soup = BeautifulSoup(result["content"], "html.parser")
    for div in soup.find_all("div", {"data-testid": "serp-ia-card"}):
        name = div.find("h3").find("a").get_text(strip=True)
        rating = div.find("span", class_="css-gutk1c").get_text(strip=True)
        review_count = div.find("span", class_="css-chan6m").get_text(strip=True).replace("(", "").replace(" reviews)", "")
        neighborhood = div.find("div", class_="css-1kiyre6").find("span", class_="css-chan6m").get_text(strip=True)
        url = div.find("h3").find("a")["href"]
        data.append({
            "name": name,
            "rating": rating,
            "review": review_count,
            "neighborhood": neighborhood,
            "url": url,
        })

Exporting data to CSV

Using the pandas library, you can easily export the extracted data into a CSV file. First, convert the data list into a data frame. Then use the to_csv() method as below:

df = pd.DataFrame(data)
df.to_csv("yelp_data.csv", index=False)

Conclusion

In this tutorial, you’ve learned how to use Web Scraper API to bypass antibot protection and extract Yelp data effortlessly. You also learned how to export the data and store it in a CSV file. By using the Web Scraper API and the techniques described in this article, you can also scrape similar complex websites without ever getting blocked.

Frequently Asked Questions

Can I download Yelp reviews?

Downloading Yelp reviews is possible. To do it quickly and efficiently, you can use our Yelp Scraper API which can deliver localized Yelp data in a matter of seconds.

About the author

Enrika Pavlovskytė

Copywriter

Enrika Pavlovskytė is a Copywriter at Oxylabs. With a background in digital heritage research, she became increasingly fascinated with innovative technologies and started transitioning into the tech world. On her days off, you might find her camping in the wilderness and, perhaps, trying to befriend a fox! Even so, she would never pass up a chance to binge-watch old horror movies on the couch.

All information on Oxylabs Blog is provided on an "as is" basis and for informational purposes only. We make no representation and disclaim all liability with respect to your use of any information contained on Oxylabs Blog or any third-party websites that may be linked therein. Before engaging in scraping activities of any kind you should consult your legal advisors and carefully read the particular website's terms of service or receive a scraping license.

Related articles

Get the latest news from data gathering world

I’m interested

IN THIS ARTICLE:


  • Why scrape Yelp?

  • Project setup

  • 1. Installing dependencies

  • 2. Importing libraries

  • 3. Setting up Web Scraper API

  • 4. Scraping Yelp business page

  • 5. Scraping Yelp search result page

  • Conclusion

Forget about complex web scraping processes

Choose Oxylabs' advanced web intelligence collection solutions to gather real-time public data hassle-free.

Scale up your business with Oxylabs®