Parsing HTML with lxml

Another powerful, fast, and flexible parser is the HTML Parser that comes with lxml. As lxml is an extensive library written for parsing both XML and HTML documents, it can handle messed up tags in the process.

Let's start with an example.

Here, we will use the requests module to retrieve the web page and parse it with lxml:

#Importing modules 
from lxml import html 
import requests 
 
response = requests.get('http://packtpub.com/') 
tree = html.fromstring(response.content)

Now the whole HTML is saved to tree in a nice tree structure that we can inspect in two different ways: XPath or CSS Select. XPath is used to navigate through elements and attributes to find information in structured documents such as HTML or XML.

We can use any ...

Get Effective Python Penetration Testing now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.

Start your free trial

Effective Python Penetration Testing by Rejah Rehim

Parsing HTML with lxml

Don’t leave empty-handed

It’s yours, free.

Check it out now on O’Reilly