Querying the DOM with XPath and lxml

XPath is a query language for selecting nodes from an XML document and is a must-learn query language for anyone performing web scraping. XPath offers a number of benefits to its user over other model-based tools:

  • Can easily navigate through the DOM tree
  • More sophisticated and powerful than other selectors like CSS selectors and regular expressions
  • It has a great set (200+) of built-in functions and is extensible with custom functions
  • It is widely supported by parsing libraries and scraping platforms 

XPath contains seven data models (we have seen some of them previously):

  • root node (top level parent node)
  • element nodes (<a>..</a>)
  • attribute nodes (href="example.html")
  • text nodes ("this is a text"

Get Python Web Scraping Cookbook now with the O’Reilly learning platform.

O’Reilly members experience books, live events, courses curated by job role, and more from O’Reilly and nearly 200 top publishers.