How to Extract Amazon Reviews with Python Scrapy?

Retail Gators July 27, 2021

7 min read

Introduction

We search many things online on the internet daily to purchase something, for comparing one product with another, to decide if one product is superior to other, etc. We straight away go to the reviews to see the stars or positive feedbacks it has received, right?

In this tutorial blog we will see how to extract Amazon reviews with Python Scrapy. We will save data in the excel spreadsheet or csv. These are the data-fields we will extract:

Review’s TitleRatingsReviewer’s NameReview’s DescriptionReview’s ContentHelpful Counts

Then we will do some basic analysis with Pandas on dataset that we have extracted. Here, some data cleaning would be needed and in the end, we will provide price comparisons on an easy visual chart with Seaborn and Matplotlib.

Between these two platforms, we have found Shopee harder to extract data for some reasons: (1) it has frustrating popup boxes that appear while entering the pages; as well as (2) website-class elements are not well-defined (a few elements have different classes).

For the reason, we would start with extracting Lazada first. We will work with Shopee during Part 2!

Initially, we import the required packages:

# Web Scrapingrfrom selenium import webdriverrfrom selenium.common.exceptions import *r# Data manipulationrimport pandas as pdr# Visualizationrimport matplotlib.pyplot as pltrimport seaborn as sns

It’s time to get started.

We choose Scrapy – a Python framework for larger-scale data scraping. Together with it, a few other packages would be needed to extract Amazon product reviews.

Requests: For sending a URL requestPandas: For exporting csvPymysql: For connecting mysql server as well as storing data thereMath: For implementing mathematical operations

You can anytime install packages like given below with conda or pip.

pip install scrapy

conda intall -c conda-forge scrapyLet’s outline Start URL for Scraping Seller’s Links

Let’s see what this will like to extract reviews for a product. We have taken the URL: https://www.amazon.com/dp/B07N9255CG This will look like this:

When we go to its review section, this looks like an image given below. This might have different names given in the reviews.

However, if you carefully inspect these requests on the back whereas loading a page as well as play a bit with next as well as previous pages of the review, you could have noticed that there’s the post request loaded having content in a page?

Here, we have looked at the payload as well as headers needed for the successful response. In case, you are having properly inspected pages, you’ll identify the change between shifting a page as well as how that reflects on requests given for that.

NEXT PAGE --- PAGE 2rhttps://www.amazon.com/hz/reviews-render/ajax/reviews/get/ref=cm_cr_arp_d_paging_btm_rnext_2rHeaders:raccept: text/html,*/*raccept-encoding: gzip, deflate, brraccept-language: en-US,en;q=0.9rcontent-type: application/x-www-form-urlencoded;charset=UTF-8rorigin: https://www.amazon.comrreferer: https://www.amazon.com/Moto-Alexa-Hands-Free-camera-included/productreviews/B07N9255CG?ie=UTF8&reviewerType=all_reviewsruser-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko)rChrome/81.0.4044.113 Safari/537.36rx-requested-with: XMLHttpRequestrrPayload:rreviewerType: all_reviewsrpageNumber: 2rshouldAppend: undefinedrreftag: cm_cr_arp_d_paging_btm_next_2rpageSize: 10rasin: B07N9255CGrrPREVIOUS PAGE --- PAGE 1rhttps://www.amazon.com/hz/reviewsrender/ajax/reviews/get/ref=cm_cr_getr_d_paging_btm_prevr_1rHeaders:raccept: text/html,*/*raccept-encoding: gzip, deflate, brraccept-language: en-US,en;q=0.9rcontent-type: application/x-www-form-urlencoded;charset=UTF-8rorigin: https://www.amazon.comrrreferer: https://www.amazon.com/Moto-Alexa-Hands-Free-camera-included/rproductreviews/B07N9255CG/rref=cm_cr_arp_d_paging_btm_next_2?rie=UTF8&reviewerType=all_reviews& pageNumber=2ruser-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64)rAppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36rx-requested-with: XMLHttpRequestrPayload:rreviewerType: all_reviewsrpageNumber: 2rshouldAppend: undefinedrreftag: cm_cr_arp_d_paging_btm_next_2rpageSize: 10rasin: B07N9255CGThe Key Part: Script/CODE

You can use two ways for making a script:

Make an entire Scrapy projectMake a group of files in the folder for narrowing down the project size

In the past tutorial, we have showed you an entire Scrapy project as well as data to make as well as modify it. This time, we have chosen the most narrowed way possible. Yes, only a group of files as well as Amazon reviews would be there!!

We are utilizing Python and Scrapy to scrape all the Amazon reviews, it’s very easy to stay convenient and take a road for xPath.

The most significant part of xPath is capturing the pattern. As to copy similar xPath from a Google inspect window as well as paste it, this is very easy but an old-school method and not effective every time.

So, what to do? Well, we will notice xPath for the similar field, let’s say “Review Title” as well as see how that makes a pattern to minimize the xPath.

Here are two examples about a related xPath given below.

Here, you can see that many similar attributes are there to a tag that has details about a “Review Title”.

So, resulting xPath to use for a Review Title would be,

//a[contains(@class,"review-title-content")]/span/text()Here, we’ve given all the xPaths for different fields that we will scrape.Review’s Title: //a[contains(@class,"review-title-content")]/span/text()Ratings: //a[contains(@title,"out of 5 stars")]/@titleReviewer’s Name: //div[@id="cm_cr-review_list"]//span[@class="a-profile-name"]/text()Review Content or Description/: //span[contains(@class,"review-text-content")]/span/text()Useful Count: /span[contains(@class,"cr-vote-text")]/text()

Apparently, some stripping as well as joining to end results within a few xPaths is very important to find perfect data. In addition, we need to remove additional white spaces.

It sounds good,

Now, we know how to move across pages as well as how to scrape data from them as well as time to collect those!!

Here is the entire code for scraping all the reviews for a single product!!!

source code: https://www.retailgators.com/how-to-extract-amazon-reviews-with-python-scrapy.php

Business