Working With Beautiful Soup in Python

Reading Time: 2 minutes

Beautiful Soup is a famous Python library which is generally used to get the desired data from HTML, XML files using your famous parser(html5.parser, lxml parser) for navigating, searching and modifying the html tree.

In this series, First we will learn the basics of the Beautiful Soup and at the end we will work on a demo project.


Installing the Beautiful Soup:-

To install the Beautiful Soup on the Windows machine use below mentioned PIP command.

[code]pip install beautifulsoup4[/code]

Download a HTML document and display by Beautiful Soup:-
To download and see the beautiful HTML document in Beautiful Soup. Lets Import the Beautiful Soup and urllib module to the project.

We are going to use “Prettify” Method to see the HTML document in the console.

[python]
__author__ = ‘WP8Dev’
from bs4 import BeautifulSoup
import urllib.request

def main():
print("***********")
testUrl = "http://scrolltest.com/about-us/"
pageSource = urllib.request.urlopen(testUrl)
soupPKG = BeautifulSoup(pageSource)
print(soupPKG.prettify())

if __name__=="__main__":
main()
[/python]

Output:-

Basics of the Beautiful Soup:-
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
But you’ll only ever have to deal with about four kinds of objects:
– Tag
– NavigableString
– BeautifulSoup
– Comment

Tag:-
A Tag object corresponds to an XML or HTML tag in the original document:
e.g

[python]
soup = BeautifulSoup(‘<b class="boldest">Extremely bold</b>’)tag = soup.b
print(tag)[/python]

Get the Attribute of the tag, You can access a tag’s attributes by treating the tag like a dictionary:-

Single-values attrubute

[code]tag[‘class’][/code]

To get the all attribs

[code]tag.attrs[/code]

Multi-valued attributes

[python]
css_soup = BeautifulSoup(‘<p class="body strikeout"></p>’)
css_soup.p[‘class’]
# ["body", "strikeout"]
[/python]

NavigableString
A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree.

[python]unicode_string = unicode(tag.string)
unicode_string
# u’Extremely bold'[/python]

Comments and other special strings

[code]markup = "<b><!–Hey, buddy. Want to buy a used parser?–></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
print(comment)[/code]

BeautifulSoup:-

The BeautifulSoup object itself represents the document as a whole. For most purposes, you can treat it as a Tag object.

Lets build a Simple program to get the all the links of the page using “find_all(‘a’)”.

[python]
from bs4 import BeautifulSoup
import urllib.request

def main():
print("***********")
testUrl = "http://scrolltest.com/about-us/"
pageSource = urllib.request.urlopen(testUrl)
soupPKG = BeautifulSoup(pageSource)
#print(soupPKG.prettify())
for link in soupPKG.find_all("a"):
print(str(link))

if __name__=="__main__":
main()
[/python]

Now we have basics of  BS, In the next tutorial we will learn more about the Beautiful Soup usage and create a demo project to scarp a website.

2 thoughts on “Working With Beautiful Soup in Python”

  1. Hello this is somewhat of off topic but I was wanting to know if blogs use WYSIWYG editors or if you have to manually code with HTML.
    I’m starting a blog soon but have no coding know-how so I wanted
    to get guidance from someone with experience. Any help would
    be enormously appreciated!

  2. absolutely much like your web-site however you really need to check out the spelling about several of this blogposts. Some of possibilities filled having punctuational problems and I to get this extremely bothersome in all seriousness on the other hand I’ll definitely go back yet again.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Shares
Tweet
Share
Pin
Share
+1