September 11, 2024

Working with PDF Files in Python

Handling PDF files in Python is a common task for tasks such as reading, writing, and manipulating PDF documents. Several libraries can help with PDF processing. Here’s an overview of popular libraries and how to use them:

1. PyPDF2

PyPDF2 is a library for reading and manipulating PDF files. It can extract text, merge PDFs, split PDFs, and more.

Installing PyPDF2

pip install PyPDF2

Example: Extracting Text from a PDF

import PyPDF2

def extract_text_from_pdf(pdf_path):
    with open(pdf_path, 'rb') as file:
        reader = PyPDF2.PdfFileReader(file)
        text = ''
        for page in range(reader.numPages):
            text += reader.getPage(page).extract_text()
    return text

pdf_path = 'example.pdf'
text = extract_text_from_pdf(pdf_path)
print(text)

Example: Merging PDF Files

from PyPDF2 import PdfFileMerger

def merge_pdfs(pdf_list, output_path):
    merger = PdfFileMerger()
    for pdf in pdf_list:
        merger.append(pdf)
    merger.write(output_path)
    merger.close()

pdf_files = ['file1.pdf', 'file2.pdf']
merge_pdfs(pdf_files, 'merged.pdf')

2. PyMuPDF (fitz)

PyMuPDF (also known as fitz) is a library for working with PDFs that offers functionalities like text extraction, image extraction, and PDF manipulation.

Installing PyMuPDF

pip install PyMuPDF

Example: Extracting Text from a PDF

import fitz  # PyMuPDF

def extract_text_from_pdf(pdf_path):
    doc = fitz.open(pdf_path)
    text = ''
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        text += page.get_text()
    return text

pdf_path = 'example.pdf'
text = extract_text_from_pdf(pdf_path)
print(text)

Example: Extracting Images from a PDF

import fitz  # PyMuPDF

def extract_images_from_pdf(pdf_path, output_folder):
    doc = fitz.open(pdf_path)
    for page_num in range(len(doc)):
        page = doc.load_page(page_num)
        image_list = page.get_images(full=True)
        for img_index, img in enumerate(image_list):
            xref = img[0]
            base_image = doc.extract_image(xref)
            image_bytes = base_image["image"]
            image_filename = f"{output_folder}/page_{page_num+1}_img_{img_index+1}.png"
            with open(image_filename, "wb") as img_file:
                img_file.write(image_bytes)

pdf_path = 'example.pdf'
output_folder = 'images'
extract_images_from_pdf(pdf_path, output_folder)

3. ReportLab

ReportLab is a library for generating PDFs programmatically. It allows you to create PDFs with complex layouts and graphics.

Installing ReportLab

pip install reportlab

Example: Creating a Simple PDF

from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas

def create_pdf(output_path):
    c = canvas.Canvas(output_path, pagesize=letter)
    c.drawString(100, 750, "Hello, World!")
    c.drawString(100, 735, "This is a PDF generated using ReportLab.")
    c.save()

output_path = 'hello.pdf'
create_pdf(output_path)

4. Summary

Python offers several libraries for working with PDF files, each with its unique capabilities:

  • PyPDF2: Good for reading, merging, and splitting PDFs.
  • PyMuPDF (fitz): Offers advanced functionalities like text extraction and image extraction.
  • ReportLab: Useful for generating and creating complex PDF documents.

Choose the library based on your specific needs for PDF processing or creation.