Posted by Marta on December 18, 2020 Viewed 4115 times
In this tutorial I will show you how to install the beautifulsoup library in Python and use some of the most popular methods such as find by class, findall, find by id, etc. This library is really useful when you are scraping information out of a website, also known as web scraping.
Beatifulsoup is a python module that helps parse html and extract information out of the html document. It provides functionality to navigate, searching and extract information out of an HTML or XML file. See the official documentation
If you like to use the beautifulsoup module in a python program, you should start by installing the library. Beautifulsoup is a third party library, meaning it’s not a built-in library installed along with python. The simplest way to install it if using pip
. Here is the command you need to run from your terminal:
pip install beautifulsoup4
What if you want to work on a virtual environment? You will need to create and activate the virtual environment first and then install the library. To do so, execute the following commands from your terminal:
python3 -m venv myvirtualenv source myvirtualenv/bin/activate pip install beautifulsoup4
The first line will create a folder called myvirtualenv that will contain a python virtual environment. Note that you can replace the last myvirtualenv
by any folder name of your choice.
Here is the HTML document that I will be using in this tutorial:
<html> <head> <title>How to scrape with beautifulsoup</title> <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css" integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh" crossorigin="anonymous"> </head> <!-- The information between the BODY and /BODY tags is displayed.--> <body> <div class="container"> <div class="header"> <h1 id="heading-1">Main heading</h1> </div> <p>Be <b>bold</b> in stating your key points. Put them in a list: </p> <ul> <div class="alert alert-primary"><li> 1.First item</li></div> <div class="alert alert-primary"><li> 2.Second item</li></div> <div class="alert alert-primary"><li> 3.Third item</li></div> </ul> <p>Improve your image by including an image. </p> <p><img src="http://www.mygifs.com/CoverImage.gif" alt="A Great HTML Resource"></p> <p>Find out all about python web scraping with <a href="https://www.google.com/">Google</a>.</p><hr> <p>Some more stuff here: <a href="page2.html">another page.</a></p> </div> </body> </html>
This document has a heading, some paragraphs and a list. If you paste this html into an html file in your computer, and open it with the browser you should see something close to the image below. For this example, I will name the file sample.html
.
Let’s see how we can scrape information out of the previous document. First you need to navigate to the specific element that hold the information to extract. Next you need to read the information, text or attribute, from the element. Usually you will navigate or reach an element of html document searching by class or id.
Beautifulsoup provides features to allow you to search an html element using its class. You can search by class using the .select()
method, passing as argument the class name. This method will run a CSS Selector against the parsed document and return all matching elements. See the example below:
from bs4 import BeautifulSoup html_file = open('./sample.html') html_parser = BeautifulSoup(html_file, 'html.parser') li_elements = html_parser.select('.alert-primary') print(len(li_elements))
Output:
3
The code above will first load the html document from a file (line3 ). In line 4 it will create a beatifulsoup structure, to easy navigate and search. And last, line 5, I will search elements that have class='alert-primary'
. The code returns 3, since there are three elements with class alert-primary.
Keep in mind the .select()
method will always return a list. Therefore if you like to access any of the resulting items, you will need to use the index access first.
We have seen how to search by class. Another handy and commonly used feature is searching by id. You can search by using the .select()
method and also using the .find()
. Let’s see an example of each.
Below there is an example of finding an element by id using the select method. As before, the select method will return a list containing all matching elements.
html_parser = BeautifulSoup(html_file, 'html.parser') heading_elements = html_parser.select('#heading-1') print(len(heading_elements))
Output:
1
The other methods that will help you to search by id are .find()
or .find_all()
. Both methods will scan the html document, however the .find()
method will stop scanning as soon as it finds a matching element and only returns one element. The .find_all()
will scan the whole document and return all matching elements. See how to use them in the example below:
.find()
html_parser = BeautifulSoup(html_file, 'html.parser') heading_element = html_parser.find(id='heading-1') print(len(heading_element)) print(type(heading_element))
Output:
1 <class 'bs4.element.Tag'>
.find_all()
html_parser = BeautifulSoup(html_file, 'html.parser') heading_element= html_parser.find_all(id='heading-1') print(len(heading_element)) print(type(heading_element))
Output:
1 <class 'bs4.element.ResultSet'>
As seen in the previous section, the .find_all()
method will scan the whole html document and return all matching elements. In case no matching element is found, it will return an empty result. You can use the find_all method to search by any html tag or by id.
html_parser = BeautifulSoup(html_file, 'html.parser') elements1= html_parser.find_all(id='heading-1') # by id elements2 = html_parser.find_all('li') print(len(elements1)) print(len(elements2) print(type(elements1)) print(type(elements2))
Output:
1 3 <class 'bs4.element.ResultSet'> <class 'bs4.element.ResultSet'>
In both case, searching by tag or by id, the find_all method will return a ResultSet variable. You can access each item in the ResultSet using the item index. See an example below where I read the text of one of the li
html elements of our html document.
elements2 = html_parser.find_all('li') print(elements2[0].text)
Output:
1.First item
Last useful feature is accessing the text of an html element. Once you found the relevant html element, getting the text will enable you to read the information and save it, so you can potentially analyse it later on.
The html elements in beautifulsoup are represented as a tag objects. A beautifulsoup tag object has an attribute text, which contains the text information. You can access this field directly or via its getter method. See the example below:
elements1= html_parser.find_all(id='heading-1') print(elements1[0].text) print(elements1[0].get_text())
To summarise, this python tutorial covered how to install the beautifulsoup module, and some commonly used operation like finding by class, find by id and accessing the text. These operation are use really often when working on web scraping. I hope you enjoy this article and thank you so much for reading and supporting this blog.
How to Fix Typeerror a bytes-like object is required not ‘str’
Learn How to Check Django Version – Ultimate Guide 2020
Steady pace book with lots of worked examples. Starting with the basics, and moving to projects, data visualisation, and web applications
Unique lay-out and teaching programming style helping new concepts stick in your memory
Great guide for those who want to improve their skills when writing python code. Easy to understand. Many practical examples
Perfect Boook for anyone who has an alright knowledge of Java and wants to take it to the next level.
Excellent read for anyone who already know how to program and want to learn Best Practices
Perfect book for anyone transitioning into the mid/mid-senior developer level
Great book and probably the best way to practice for interview. Some really good information on how to perform an interview. Code Example in Java