Beautiful Soup is a famous Python library which is generally used to get the desired data from HTML, XML files using your famous parser(html5.parser, lxml parser) for navigating, searching and modifying the html tree.
In this series, First we will learn the basics of the Beautiful Soup and at the end we will work on a demo project.
Installing the Beautiful Soup:-
To install the Beautiful Soup on the Windows machine use below mentioned PIP command.
[code]pip install beautifulsoup4[/code]
Download a HTML document and display by Beautiful Soup:-
To download and see the beautiful HTML document in Beautiful Soup. Lets Import the Beautiful Soup and urllib module to the project.
1 2 | from bs4 import BeautifulSoup import urllib.request |
We are going to use “Prettify” Method to see the HTML document in the console.
[python]
__author__ = ‘WP8Dev’
from bs4 import BeautifulSoup
import urllib.request
def main():
print("***********")
testUrl = "http://scrolltest.com/about-us/"
pageSource = urllib.request.urlopen(testUrl)
soupPKG = BeautifulSoup(pageSource)
print(soupPKG.prettify())
if __name__=="__main__":
main()
[/python]
Output:-
Basics of the Beautiful Soup:-
Beautiful Soup transforms a complex HTML document into a complex tree of Python objects.
But you’ll only ever have to deal with about four kinds of objects:
– Tag
– NavigableString
– BeautifulSoup
– Comment
Tag:-
A Tag object corresponds to an XML or HTML tag in the original document:
e.g
[python]
soup = BeautifulSoup(‘<b class="boldest">Extremely bold</b>’)tag = soup.b
print(tag)[/python]
Get the Attribute of the tag, You can access a tag’s attributes by treating the tag like a dictionary:-
Single-values attrubute
[code]tag[‘class’][/code]
To get the all attribs
[code]tag.attrs[/code]
Multi-valued attributes
[python]
css_soup = BeautifulSoup(‘<p class="body strikeout"></p>’)
css_soup.p[‘class’]
# ["body", "strikeout"]
[/python]
NavigableString
A NavigableString is just like a Python Unicode string, except that it also supports some of the features described in Navigating the tree and Searching the tree.
[python]unicode_string = unicode(tag.string)
unicode_string
# u’Extremely bold'[/python]
Comments and other special strings
[code]markup = "<b><!–Hey, buddy. Want to buy a used parser?–></b>"
soup = BeautifulSoup(markup)
comment = soup.b.string
print(comment)[/code]
BeautifulSoup:-
The BeautifulSoup object itself represents the document as a whole. For most purposes, you can treat it as a Tag object.
Lets build a Simple program to get the all the links of the page using “find_all(‘a’)”.
[python]
from bs4 import BeautifulSoup
import urllib.request
def main():
print("***********")
testUrl = "http://scrolltest.com/about-us/"
pageSource = urllib.request.urlopen(testUrl)
soupPKG = BeautifulSoup(pageSource)
#print(soupPKG.prettify())
for link in soupPKG.find_all("a"):
print(str(link))
if __name__=="__main__":
main()
[/python]
Now we have basics of BS, In the next tutorial we will learn more about the Beautiful Soup usage and create a demo project to scarp a website.
Hello this is somewhat of off topic but I was wanting to know if blogs use WYSIWYG editors or if you have to manually code with HTML.
I’m starting a blog soon but have no coding know-how so I wanted
to get guidance from someone with experience. Any help would
be enormously appreciated!
absolutely much like your web-site however you really need to check out the spelling about several of this blogposts. Some of possibilities filled having punctuational problems and I to get this extremely bothersome in all seriousness on the other hand I’ll definitely go back yet again.