In this article, I’ll teach you how to build a web scraper using Python & BeautifulSoup. This idea was born of my efforts to scrape data from competitor sites for a new web-application that I’m building.
A few months ago I started building a web application. A data-driven web application that collates and ranks products in an intuitive manner, with Search Engine optimization built-in. However, there are competitors that already exist that have ten of thousands of products.
But, I didn’t let this deter me. First off there is competition over a few things; there is:
- Competition over traffic
- Competition over data
- And, competition over the search results
Most of these problems are solved by harvesting data from my competitors. That way, my application will have a similar (if not more) number of products. More products = more pages, more pages = more potential keywords to rank for on search engines.
But, I had to actually get the data.
I could go through and add products one-by-one, but that’d be a really tedious task. And after all, my competitors are not advertising their products in a nice format for me to reupload…
Though, there’s no reason why I can’t make my own.
So, I turned to Python and a library called BeautifulSoup. Python is a popular programming language, not to difficult to pickup, and BeautifulSoup is used for web-scraping.
What BeautifulSoup allowed me to do, was deconstruct web-pages into elements. Most websites are dynamic nowadays–each page is computer-generated, people don’t actually write every single page. Instead, they use templates, and a programming language fills in the details on the template. This means that elements are consistently in the same place, with the same unique identifiers.
See for example, here, on IMDB:
The structure of the page is the same whether we look Animated TV Episodes (seen above), or whether we look at the Sci-Fi video games (seen below).
This means we can harvest the data from these pages, using the same code. The table is the same on both pages, each consisting of multiple rows, title, thumbnail image, star rating. We just need to know how to access them.
Identifying scrapable elements
Lucky for us, most browsers are pretty forward-thinking nowadays and have built-in development toolkits. These toolkits allow you to Inspect Elements on-page.
All you have to do is right-click, Inspect.
You may have to toggle to the inspect mode, depending on what browser you’re using.
But once you’re in inspect mode, you can now hover over all the elements on the page. Sticking to our IMDB example, I hover over a row, I see this.
And note the little tooltip that popups up that says
div.lister-item.mode-advanced. This tooltip displays the “class” value of the element. And every other row on this page will be built of the same parent component.
See how it’s the same for the second row. Take a note of this parent class, for use when we scrape the page later. We’ll want to loop through all the
div.lister-item.mode-advanced nodes on the page when we come to scrape the content.
Within each of the parent nodes, we can also identify the title of each product. Again, we’ll be using the “class” value to identify the title relative to the parent
div list-item.... We can also see the data in HTML format below; if we hover over the markup it will highlight on-page too. Which is super helpful.
The same principle above can be applied to all elements of the row. Whether we want to pull the directors, or the movie’s genres, star rating. Anything in that row that is repeated on other rows, we can scrape.
Coding a web scraper
Now we that we’ve got a decent understanding of how the elements on the page function, and we’ve figured out some generic class names, it’s time to start coding.
- Visual Studio Code (not essential)
Visual Studio Code
Personally, I most enjoy using Visual Studio Code as my coding environment. It’s a free IDE from Microsoft, incredibly popular and extremely versatile. It comes with built-in syntax highlighting for Python, and any extension you might want is probably available on its extensions market place.
If you haven’t got Python installed, you’ll need it. The follow will be coded in Python3.
First and foremost, you need Python installed. Once python is installed, you need to get pip package manager (download and run this pip installer). Pip is a tool that installs Python libraries via the command line.
So, say we wanted to install BeautifulSoup with pip, we can just run pip install beautifulsoup4. There’re numerous packages available to install via pip.
Run the following code in your command prompt or terminal.
pip install beautifulsoup4; pip install requests;
Scraping a simple tag with BeautifulSoup
Before we dive in super deep, lets start with a minimal scraper. Once that literally just grabs the site’s
<title> tag. This contains the text show on the browser’s tab.
# a minimal python beautiful soup webscraper from bs4 import BeautifulSoup from requests import get url = "https://www.imdb.com/search/title/?title_type=video_game" soup = BeautifulSoup(get(url).text, 'html.parser') print(soup.title.text)
The out that from running this script is as a follows.
>>> print(soup.title.text) Video Game (Sorted by Popularity Ascending) - IMDb
Great, so we’ve managed to scrape some data from the page. Now we can expand on this, and screen multiple elements
Build a web scraper that actually pulls product data
Using the previous code as the start template, we can scrape text using the classes we found earlier:
Note that the formatting is slightly different in the script however. The prefix of the selector is put first, then the classes are added with a “[space]” seperator instead of a “.”.
# a minimal python beautiful soup webscraper from bs4 import BeautifulSoup from requests import get url = "https://www.imdb.com/search/title/?title_type=video_game" soup = BeautifulSoup(get(url).text, 'html.parser') # get all the rows rows = soup.find_all('div', class_="lister-item mode-advanced") # loop through all the rows for row in rows: # the the .lister-item-header in that row listerItemHeader = row.find('h3', class_="lister-item-header") # print the text of the header (and remove any line breaks) print(listerItemHeader.text.replace("\n"," "))
And there we have it, we’ve scraped our first element. After running we get the following.
1. Cyberpunk 2077 (2020 Video Game) 2. Assassin's Creed Valhalla (2020 Video Game) 3. The Last of Us: Part II (2020 Video Game) 4. Star Wars Jedi: Fallen Order (2019 Video Game) 5. Red Dead Redemption II (2018 Video Game) 6. Death Stranding (2019 Video Game) 7. Grand Theft Auto V (2013 Video Game) 8. Call of Duty: Black Ops Cold War (2020 Video Game) 9. Spider-Man: Miles Morales (2020 Video Game) ... 49. Assassin's Creed: Origins (2017 Video Game) 50. Far Cry 5 (2018 Video Game)
Now that we’ve confidently pulled the title, we can use the same principle to pull other elements. If want the image, we should reference the
img tag used in the row. If we want the description, we can reference the description block by tag.
Following this scraping we can store this data wherever needed.
Scraping data from competitor websites is easy. Especially if their site uses consistent templates. All you need is Python, BeautifulSoup, and the element selectors that you’re looking to scrape.
Feel free to ask any questions, if you need any help.
Until next time,