How To Analyse Amazon Reviews For Climbing Shoe Analysis Using Python?
Software Engineering

How To Analyse Amazon Reviews For Climbing Shoe Analysis Using Python?

iWeb Scraping
iWeb Scraping
7 min read

Those who are familiar with the sport recognise the significance of wearing a pair of rock-climbing shoes that fit well. Shoes that don't fit well can be disastrous in a variety of ways. Because it's a fringe sport, finding a pair that fits properly isn't as simple as stepping into a store and trying them on. Climbing shoe stores may be far and few between depending on one's place of residence. As a result, many people opt to buy it online.

On the other hand, buying a pair, is no easy task, and scrolling through hundreds of reviews is bound to irritate even the most methodical among us. As a result, in this blog, we have scraped various shoe products available on Amazon for detailed study.

Web Scraping Amazon Reviews Dataset

Here, we will use Selenium library in Python and divide the web scraping Services in various parts:

Part 1: Fetching Product URLs

The first-step will be to initialize the webdriver. For this we will use Google webdriver which can be downloaded using this link. The below code is used for starting chrome web browser.

Then we use get() method to get access to Amazon website. For avoiding simulation of entering search queries and keystrokes, simply attach search query to the base URL. The webpage coming from the user search will have the following format: https://www.amazon.com/ < query will be placed here>

Executing the below script will open the Amazon search result page for search query ‘la sportive climbing shoe’

The next stage would be to scrape all of the URLs on the page and save them to a file before moving on to the next page of search results. This process continues until you reach the final results page. For this approach, I write a few utility functions.

A string is written to a file using this function. The “a” option determines whether to create a new file (if one has not yet been created) or to publish to an existing file.

This method scrapes the entire current page for all URLs. If the words "climbing" and "shoe" appear in the URL, it writes the URLs to a file links.txt (using the write to file utility function). To avoid the Python NoneType issue that occurs when Selenium picks up NoneType objects, an additional condition is introduced to evaluate only strings (type(link) == str).

To put it all together, we begin by calculating the number of results pages. The class “a-normal” is used to represent integers in the DOM Tree. Here, you will get all numbers (in this case 1 to 5) using the find elements by class name method, then set the last number to the variable last page (.text extracts the actual text value which is typecast to an int variable).

We scrape the URLs by cycling over the results pages after putting the previous two lines into a function. The base URL was combined with both the search query and page number in Amazon's URL structure for each incrementing page number: https://www.amazon.com/s?k=query&page=page number. If the results just had one webpage, this exception handler was created (pages would evaluate to an empty list). Other major climbing shoe brands that went through this process were Ocun, Five Ten, Tenaya, Black Diamond, Scarpa, Mad Rock, Evolv, and Climb X.

The initial scraping pass yielded 2655 URLs. Filtering away undesirable URLs, however, required more cleaning. Examine the following list of nefarious URLs:

The first URL is a link to a similar product that is essentially an advertisement (these would refer to the products that have the sponsored tag on the results page). The second, third, and fourth URLs all lead to the search result, not a specific product page. The final URL is an advertisement URL as well.

Let's see how this compares to URLs that lead to individual product pages.

The first thing to notice is that clean URLs are free of slredirect and picassoRedirect.html. When we split down a system of clean URLs, we can see that the base URL is accompanied by the product name as the first two sections of the URL. In contrary, “dirty” URLs include the query signifier, s?k=, immediately after the basic URL. Another point to note is that product URLs begin with https://www.amazon.com/, as opposed to the dirty URLs' last example, which did not use that string as its base URL. To filter out these bad URLs, some simple data cleaning using regex was developed.

Furthermore, there seem to be a lot of URLs pointing to almost the same product page (even though their URLs were not 100 percent identical). Consider the following scenario:

The URL structure of Amazon is deconstructed in this blog post. In summary, only the “/dp/” and “ASIN” (Amazon Standard Identification Number) sections of URLs were significant. As a result, from the “ref=” section onwards, I erased everything. The final stage was to eliminate any URLs that directed to brands that were not part of this project. The data cleaning process yielded 425 clean, unique, and relevant URLs.

Part 2: Scraping Product Data

The flow of the scraping is as follows:

Go to the product page and scrape product title, product price.

Data Cleaning and Preprocessing

The initial step was to create two brand and model fields. The brand may easily be deduced from the product title. The extraction of the model, on the other hand, proved to be a difficult undertaking. First, we will sample 15 papers and analyze what a product name is made up of to identify the general components of a product name. To do so, we will use the MongoDB Query Language (MQL).

For text cleaning and preprocessing, Python's NLTK module is used. Customization of data cleaning activities to the algorithm is required for conducting sentiment analysis on Amazon reviews using VADER. In a nutshell, VADER assesses public sentiment by:

PunctuationCapitalizationDegree ModifiersConjunctionsWords That Flip The Observations

Below is an example of a review that has had its stop words removed. It may be seen qualitatively that the review's meaning has been preserved.

Great aggressive shoe. I have a wider foot and these fit me great. The laces really allow me to tune the fit. I downsized a half size from my gym shoe size. Initially pretty uncomfortable, but a shower and a few climbing sessions and they fit just right.Great aggressive shoe . I wider foot fit great. The laces really allow tune fit. I downsized half size gym shoe size. Initially pretty uncomfortable, shower climbing sessions fit right.

We also wanted each page to represent a single model for a single brand. The fields name and model could be used as a composite key in a relational database. The code for deleting duplicate documents can be available at the above-mentioned GitHub repository.

We have used VADER to conduct sentiment analysis on the reviews in order to turn them into something quantitative. A short lesson on how to use it may be found here. The following was my code implementation:

Discussion (0 comments)

0 comments

No comments yet. Be the first!