October 13, 2024

Python lxml Module

The lxml module in Python is a powerful and feature-rich library for processing XML and HTML documents. It provides a comprehensive API for parsing, creating, and manipulating XML and HTML data. It is widely used due to its performance, ease of use, and extensive functionality.

1. Installation

To use the lxml module, you need to install it. You can do so using pip:

pip install lxml
    

2. Basic Usage

Here are some basic operations you can perform with the lxml module:

Parsing XML

from lxml import etree

# Parse XML from a string
xml_string = '''
    Content 1
    Content 2
'''

root = etree.fromstring(xml_string)

# Print the XML structure
print(etree.tostring(root, pretty_print=True).decode())
    

Creating XML

from lxml import etree

# Create XML elements
root = etree.Element("root")
child1 = etree.SubElement(root, "child", name="child1")
child1.text = "Content 1"
child2 = etree.SubElement(root, "child", name="child2")
child2.text = "Content 2"

# Convert to string
xml_str = etree.tostring(root, pretty_print=True).decode()
print(xml_str)
    

Parsing HTML

from lxml import html

# Parse HTML from a string
html_string = '''
    
    Title
        

This is a paragraph.

''' tree = html.fromstring(html_string) # Extract content title = tree.xpath('//h1/text()')[0] print("Title:", title)

3. XPath Queries

The lxml module supports XPath, which allows you to query and navigate XML and HTML documents. Here’s an example:

from lxml import etree

# Parse XML
xml_string = '''
    Item 1
    Item 2
'''
root = etree.fromstring(xml_string)

# XPath query
items = root.xpath('//item/text()')
print("Items:", items)
    

4. XSLT Transformations

With lxml, you can also perform XSLT transformations:

from lxml import etree

# Parse XML and XSLT
xml_string = '''
    Item 1
    Item 2
'''

xslt_string = '''
    
        
            
                

Items

''' xml_root = etree.fromstring(xml_string) xslt_root = etree.fromstring(xslt_string) transform = etree.XSLT(xslt_root) result = transform(xml_root) print(result)

5. Conclusion

The lxml module is a versatile and powerful tool for working with XML and HTML data in Python. Its support for XPath and XSLT, along with its ability to handle large documents efficiently, makes it an excellent choice for many applications involving structured data.