Python Beautiful Soup code
A screenshot of a web-scraper built with Python & BeautifulSoup4

In this article, I’ll teach you how to build a web scraper using Python & BeautifulSoup. This idea was born of my efforts to scrape data from competitor sites for a new web-application that I’m building.

A few months ago I started building a web application. A data-driven web application that collates and ranks products in an intuitive manner, with Search Engine optimization built-in. However, there are competitors that already exist that have ten of thousands of products.

But, I didn’t let this deter me. First off there is competition over a few things; there is:

  • Competition over traffic
  • Competition over data
  • And, competition over the search results

Most of these problems are solved by harvesting data from my competitors. That way, my application will have a similar (if not more) number of products. More products = more pages, more pages = more potential keywords to rank for on search engines.

But, I had to actually get the data.

I could go through and add products one-by-one, but that’d be a really tedious task. And after all, my competitors are not advertising their products in a nice format for me to reupload…

Though, there’s no reason why I can’t make my own.

So, I turned to Python and a library called BeautifulSoup. Python is a popular programming language, not to difficult to pickup, and BeautifulSoup is used for web-scraping.

What BeautifulSoup allowed me to do, was deconstruct web-pages into elements. Most websites are dynamic nowadays--each page is computer-generated, people don’t actually write every single page. Instead, they use templates, and a programming language fills in the details on the template. This means that elements are consistently in the same place, with the same unique identifiers.

See for example, here, on IMDB:

image 14

The structure of the page is the same whether we look Animated TV Episodes (seen above), or whether we look at the Sci-Fi video games (seen below).

image 15

This means we can harvest the data from these pages, using the same code. The table is the same on both pages, each consisting of multiple rows, title, thumbnail image, star rating. We just need to know how to access them.

Identifying scrapable elements

Lucky for us, most browsers are pretty forward-thinking nowadays and have built-in development toolkits. These toolkits allow you to Inspect Elements on-page.

Inspect elements

All you have to do is right-click, Inspect.

image 16

You may have to toggle to the inspect mode, depending on what browser you’re using.

image 17

But once you’re in inspect mode, you can now hover over all the elements on the page. Sticking to our IMDB example, I hover over a row, I see this.

image 18

And note the little tooltip that popups up that says div.lister-item.mode-advanced. This tooltip displays the “class” value of the element. And every other row on this page will be built of the same parent component.

image 19

See how it’s the same for the second row. Take a note of this parent class, for use when we scrape the page later. We’ll want to loop through all the div.lister-item.mode-advanced nodes on the page when we come to scrape the content.

Within each of the parent nodes, we can also identify the title of each product. Again, we’ll be using the “class” value to identify the title relative to the parent div list-item.... We can also see the data in HTML format below; if we hover over the markup it will highlight on-page too. Which is super helpful.

image 20

The same principle above can be applied to all elements of the row. Whether we want to pull the directors, or the movie’s genres, star rating. Anything in that row that is repeated on other rows, we can scrape.

Coding a web scraper

Now we that we’ve got a decent understanding of how the elements on the page function, and we’ve figured out some generic class names, it’s time to start coding.

Requirements

  • Visual Studio Code (not essential)
  • Python
  • PIP
  • BeautifulSoup

Visual Studio Code

Personally, I most enjoy using Visual Studio Code as my coding environment. It’s a free IDE from Microsoft, incredibly popular and extremely versatile. It comes with built-in syntax highlighting for Python, and any extension you might want is probably available on its extensions market place.

Python

If you haven’t got Python installed, you’ll need it. The follow will be coded in Python3.

Pip

First and foremost, you need Python installed. Once python is installed, you need to get pip package manager (download and run this pip installer). Pip is a tool that installs Python libraries via the command line.

BeautifulSoup

So, say we wanted to install BeautifulSoup with pip, we can just run pip install beautifulsoup4. There’re numerous packages available to install via pip.

Run the following code in your command prompt or terminal.


pip install beautifulsoup4;
pip install requests;

Scraping a simple tag with BeautifulSoup

Before we dive in super deep, lets start with a minimal scraper. Once that literally just grabs the site’s <title> tag. This contains the text show on the browser’s tab.

# a minimal python beautiful soup webscraper
from bs4 import BeautifulSoup
from requests import get
url = "https://www.imdb.com/search/title/?title_type=video_game"
soup = BeautifulSoup(get(url).text, 'html.parser')
print(soup.title.text)

The out that from running this script is as a follows.

>>> print(soup.title.text)
Video Game
(Sorted by Popularity Ascending) - IMDb

Great, so we’ve managed to scrape some data from the page. Now we can expand on this, and screen multiple elements

Build a web scraper that actually pulls product data

Using the previous code as the start template, we can scrape text using the classes we found earlier:

  • div.lister-item.mode-advanced
  • h3.lister-item-header

Note that the formatting is slightly different in the script however. The prefix of the selector is put first, then the classes are added with a “[space]” seperator instead of a “.”.

# a minimal python beautiful soup webscraper
from bs4 import BeautifulSoup
from requests import get
url = "https://www.imdb.com/search/title/?title_type=video_game"
soup = BeautifulSoup(get(url).text, 'html.parser')


# get all the rows
rows = soup.find_all('div', class_="lister-item mode-advanced")

# loop through all the rows
for row in rows:
	# the the .lister-item-header in that row
	listerItemHeader = row.find('h3', class_="lister-item-header")
	# print the text of the header (and remove any line breaks)
	print(listerItemHeader.text.replace("\n"," "))

And there we have it, we’ve scraped our first element. After running we get the following.

 1. Cyberpunk 2077 (2020 Video Game)
 2. Assassin's Creed Valhalla (2020 Video Game)
 3. The Last of Us: Part II (2020 Video Game)
 4. Star Wars Jedi: Fallen Order (2019 Video Game)
 5. Red Dead Redemption II (2018 Video Game)
 6. Death Stranding (2019 Video Game)
 7. Grand Theft Auto V (2013 Video Game)
 8. Call of Duty: Black Ops Cold War (2020 Video Game)
 9. Spider-Man: Miles Morales (2020 Video Game)
...
 49. Assassin's Creed: Origins (2017 Video Game)
 50. Far Cry 5 (2018 Video Game)

Now that we’ve confidently pulled the title, we can use the same principle to pull other elements. If want the image, we should reference the img tag used in the row. If we want the description, we can reference the description block by tag.

Following this scraping we can store this data wherever needed.

Conclusion

Scraping data from competitor websites is easy. Especially if their site uses consistent templates. All you need is Python, BeautifulSoup, and the element selectors that you’re looking to scrape.

Feel free to ask any questions, if you need any help.

Until next time,

Josh

LEAVE A REPLY

Please enter your comment!
Please enter your name here