Beautifulsoup Python Guide – Install, Find by Class and more

Posted by Marta on December 18, 2020 Viewed 4115 times

Card image cap

In this tutorial I will show you how to install the beautifulsoup library in Python and use some of the most popular methods such as find by class, findall, find by id, etc. This library is really useful when you are scraping information out of a website, also known as web scraping.

Beatifulsoup is a python module that helps parse html and extract information out of the html document. It provides functionality to navigate, searching and extract information out of an HTML or XML file. See the official documentation

Install Beatifulsoup

If you like to use the beautifulsoup module in a python program, you should start by installing the library. Beautifulsoup is a third party library, meaning it’s not a built-in library installed along with python. The simplest way to install it if using pip. Here is the command you need to run from your terminal:

pip install beautifulsoup4

What if you want to work on a virtual environment? You will need to create and activate the virtual environment first and then install the library. To do so, execute the following commands from your terminal:

python3 -m venv myvirtualenv 
source myvirtualenv/bin/activate
pip install beautifulsoup4

The first line will create a folder called myvirtualenv that will contain a python virtual environment. Note that you can replace the last myvirtualenv by any folder name of your choice.

Sample HTML document

Here is the HTML document that I will be using in this tutorial:

<html>
<head>
    <title>How to scrape with beautifulsoup</title>
    <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css" integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh" crossorigin="anonymous">
</head>
<!-- The information between the BODY and /BODY tags is displayed.-->
<body>
    <div class="container">
        <div class="header">
            <h1 id="heading-1">Main heading</h1>
        </div>
    
        <p>Be <b>bold</b> in stating your key points. Put them in a list: </p>
        <ul>
            <div class="alert alert-primary"><li> 1.First item</li></div>
            <div class="alert alert-primary"><li> 2.Second item</li></div>
            <div class="alert alert-primary"><li> 3.Third item</li></div>
        </ul>
        <p>Improve your image by including an image. </p>
        <p><img src="http://www.mygifs.com/CoverImage.gif" alt="A Great HTML Resource"></p>
        <p>Find out all about python web scraping with <a href="https://www.google.com/">Google</a>.</p><hr>
        <p>Some more stuff here: <a href="page2.html">another page.</a></p>
    </div>
</body>
</html>

This document has a heading, some paragraphs and a list. If you paste this html into an html file in your computer, and open it with the browser you should see something close to the image below. For this example, I will name the file sample.html.

Beatifulsoup Methods: Find by Class and By Id

Let’s see how we can scrape information out of the previous document. First you need to navigate to the specific element that hold the information to extract. Next you need to read the information, text or attribute, from the element. Usually you will navigate or reach an element of html document searching by class or id.

Beautifulsoup Find by Class: Select

Beautifulsoup provides features to allow you to search an html element using its class. You can search by class using the .select() method, passing as argument the class name. This method will run a CSS Selector against the parsed document and return all matching elements. See the example below:

from bs4 import BeautifulSoup

html_file = open('./sample.html')
html_parser = BeautifulSoup(html_file, 'html.parser')
li_elements = html_parser.select('.alert-primary')

print(len(li_elements))

Output:

3

The code above will first load the html document from a file (line3 ). In line 4 it will create a beatifulsoup structure, to easy navigate and search. And last, line 5, I will search elements that have class='alert-primary'. The code returns 3, since there are three elements with class alert-primary.

Keep in mind the .select() method will always return a list. Therefore if you like to access any of the resulting items, you will need to use the index access first.

Find by Id

We have seen how to search by class. Another handy and commonly used feature is searching by id. You can search by using the .select() method and also using the .find(). Let’s see an example of each.

Below there is an example of finding an element by id using the select method. As before, the select method will return a list containing all matching elements.

html_parser = BeautifulSoup(html_file, 'html.parser')
heading_elements = html_parser.select('#heading-1')
print(len(heading_elements))

Output:

1

The other methods that will help you to search by id are .find() or .find_all(). Both methods will scan the html document, however the .find() method will stop scanning as soon as it finds a matching element and only returns one element. The .find_all() will scan the whole document and return all matching elements. See how to use them in the example below:

.find()

html_parser = BeautifulSoup(html_file, 'html.parser')
heading_element = html_parser.find(id='heading-1')
print(len(heading_element))
print(type(heading_element))

Output:

1
<class 'bs4.element.Tag'>

.find_all()

html_parser = BeautifulSoup(html_file, 'html.parser')
heading_element= html_parser.find_all(id='heading-1')
print(len(heading_element))
print(type(heading_element))

Output:

1
<class 'bs4.element.ResultSet'>

Beatifulsoup method: Findall

As seen in the previous section, the .find_all() method will scan the whole html document and return all matching elements. In case no matching element is found, it will return an empty result. You can use the find_all method to search by any html tag or by id.

html_parser = BeautifulSoup(html_file, 'html.parser')
elements1= html_parser.find_all(id='heading-1') # by id
elements2 = html_parser.find_all('li')
print(len(elements1))
print(len(elements2)
print(type(elements1))
print(type(elements2))

Output:

1
3
<class 'bs4.element.ResultSet'>
<class 'bs4.element.ResultSet'>

In both case, searching by tag or by id, the find_all method will return a ResultSet variable. You can access each item in the ResultSet using the item index. See an example below where I read the text of one of the li html elements of our html document.

elements2 = html_parser.find_all('li')
print(elements2[0].text)

Output:

 1.First item

Beatifulsoup method: Get Text

Last useful feature is accessing the text of an html element. Once you found the relevant html element, getting the text will enable you to read the information and save it, so you can potentially analyse it later on.

The html elements in beautifulsoup are represented as a tag objects. A beautifulsoup tag object has an attribute text, which contains the text information. You can access this field directly or via its getter method. See the example below:

elements1= html_parser.find_all(id='heading-1')
print(elements1[0].text)
print(elements1[0].get_text())

Conclusion

To summarise, this python tutorial covered how to install the beautifulsoup module, and some commonly used operation like finding by class, find by id and accessing the text. These operation are use really often when working on web scraping. I hope you enjoy this article and thank you so much for reading and supporting this blog.

Recommended articles

How to Create Tkinter Progress Bar and DropDown

How to Fix Typeerror a bytes-like object is required not ‘str’

Learn How to Check Django Version – Ultimate Guide 2020

DFS in Python – How to Implement it – Ultimate Guide

How to convert XML to JSON in Python – Ultimate guide

Project-Based Programming Introduction

Steady pace book with lots of worked examples. Starting with the basics, and moving to projects, data visualisation, and web applications

100% Recommended book for Java Beginners

Unique lay-out and teaching programming style helping new concepts stick in your memory

90 Specific Ways to Write Better Python

Great guide for those who want to improve their skills when writing python code. Easy to understand. Many practical examples

Grow Your Java skills as a developer

Perfect Boook for anyone who has an alright knowledge of Java and wants to take it to the next level.

Write Code as a Professional Developer

Excellent read for anyone who already know how to program and want to learn Best Practices

Every Developer should read this

Perfect book for anyone transitioning into the mid/mid-senior developer level

Great preparation for interviews

Great book and probably the best way to practice for interview. Some really good information on how to perform an interview. Code Example in Java