October 13, 2024

PDF Handling in Python

Handling PDF files in Python involves tasks such as reading, writing, merging, splitting, and manipulating PDF documents. Various libraries are available to perform these operations, each offering different functionalities. Below are some popular libraries and examples of how to use them for PDF handling.

1. PyPDF2

PyPDF2 is a widely used library for manipulating PDF files. It allows you to extract information, merge or split pages, and add watermarks to PDFs.

1.1. Installation

To install PyPDF2, use pip:

pip install PyPDF2
    

1.2. Basic Usage

Here’s an example of how to use PyPDF2 to merge two PDF files:

import PyPDF2

# Open PDF files
pdf1 = open('file1.pdf', 'rb')
pdf2 = open('file2.pdf', 'rb')

# Create PDF reader objects
pdf_reader1 = PyPDF2.PdfReader(pdf1)
pdf_reader2 = PyPDF2.PdfReader(pdf2)

# Create a PDF writer object
pdf_writer = PyPDF2.PdfWriter()

# Add pages from the first PDF
for page in pdf_reader1.pages:
    pdf_writer.add_page(page)

# Add pages from the second PDF
for page in pdf_reader2.pages:
    pdf_writer.add_page(page)

# Write out the combined PDF
with open('merged.pdf', 'wb') as output_pdf:
    pdf_writer.write(output_pdf)

# Close the files
pdf1.close()
pdf2.close()
    

2. PyMuPDF (fitz)

PyMuPDF, also known as fitz, is another library for working with PDFs and other document formats. It provides a more comprehensive API for PDF manipulation and supports features like text extraction and rendering.

2.1. Installation

To install PyMuPDF, use pip:

pip install PyMuPDF
    

2.2. Basic Usage

Here’s how to use PyMuPDF to extract text from a PDF:

import fitz  # PyMuPDF

# Open a PDF file
pdf_document = fitz.open('example.pdf')

# Extract text from each page
for page_num in range(len(pdf_document)):
    page = pdf_document.load_page(page_num)
    text = page.get_text()
    print(f"Page {page_num + 1} text:n{text}n")

# Close the PDF file
pdf_document.close()
    

3. PDFMiner

PDFMiner is designed for extracting text and metadata from PDF files. It is particularly useful for text extraction tasks and provides detailed control over the extraction process.

3.1. Installation

To install PDFMiner, use pip:

pip install pdfminer.six
    

3.2. Basic Usage

Here’s an example of how to use PDFMiner to extract text from a PDF:

from pdfminer.high_level import extract_text

# Extract text from a PDF file
text = extract_text('example.pdf')
print(text)
    

4. ReportLab

ReportLab is a powerful library for generating PDFs. It allows you to create new PDF documents with custom layouts and content, including text, graphics, and images.

4.1. Installation

To install ReportLab, use pip:

pip install reportlab
    

4.2. Basic Usage

Here’s how to create a simple PDF document with ReportLab:

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

# Create a PDF file
c = canvas.Canvas('example.pdf', pagesize=letter)
width, height = letter

# Draw text and shapes
c.drawString(100, 750, 'Hello, PDF!')
c.rect(100, 700, 200, 100)

# Save the PDF file
c.save()
    

5. Conclusion

Python provides several powerful libraries for handling PDF files, each offering different features and functionalities. Depending on your needs—whether it’s extracting text, merging files, or creating new PDFs—you can choose the library that best suits your requirements. Libraries like PyPDF2, PyMuPDF, PDFMiner, and ReportLab offer robust solutions for various PDF-related tasks.