Handling PDF files in Python is a common task for tasks such as reading, writing, and manipulating PDF documents. Several libraries can help with PDF processing. Here’s an overview of popular libraries and how to use them:
1. PyPDF2
PyPDF2 is a library for reading and manipulating PDF files. It can extract text, merge PDFs, split PDFs, and more.
Installing PyPDF2
pip install PyPDF2
Example: Extracting Text from a PDF
import PyPDF2
def extract_text_from_pdf(pdf_path):
with open(pdf_path, 'rb') as file:
reader = PyPDF2.PdfFileReader(file)
text = ''
for page in range(reader.numPages):
text += reader.getPage(page).extract_text()
return text
pdf_path = 'example.pdf'
text = extract_text_from_pdf(pdf_path)
print(text)
Example: Merging PDF Files
from PyPDF2 import PdfFileMerger
def merge_pdfs(pdf_list, output_path):
merger = PdfFileMerger()
for pdf in pdf_list:
merger.append(pdf)
merger.write(output_path)
merger.close()
pdf_files = ['file1.pdf', 'file2.pdf']
merge_pdfs(pdf_files, 'merged.pdf')
2. PyMuPDF (fitz)
PyMuPDF (also known as fitz) is a library for working with PDFs that offers functionalities like text extraction, image extraction, and PDF manipulation.
Installing PyMuPDF
pip install PyMuPDF
Example: Extracting Text from a PDF
import fitz # PyMuPDF
def extract_text_from_pdf(pdf_path):
doc = fitz.open(pdf_path)
text = ''
for page_num in range(len(doc)):
page = doc.load_page(page_num)
text += page.get_text()
return text
pdf_path = 'example.pdf'
text = extract_text_from_pdf(pdf_path)
print(text)
Example: Extracting Images from a PDF
import fitz # PyMuPDF
def extract_images_from_pdf(pdf_path, output_folder):
doc = fitz.open(pdf_path)
for page_num in range(len(doc)):
page = doc.load_page(page_num)
image_list = page.get_images(full=True)
for img_index, img in enumerate(image_list):
xref = img[0]
base_image = doc.extract_image(xref)
image_bytes = base_image["image"]
image_filename = f"{output_folder}/page_{page_num+1}_img_{img_index+1}.png"
with open(image_filename, "wb") as img_file:
img_file.write(image_bytes)
pdf_path = 'example.pdf'
output_folder = 'images'
extract_images_from_pdf(pdf_path, output_folder)
3. ReportLab
ReportLab is a library for generating PDFs programmatically. It allows you to create PDFs with complex layouts and graphics.
Installing ReportLab
pip install reportlab
Example: Creating a Simple PDF
from reportlab.lib.pagesizes import letter
from reportlab.pdfgen import canvas
def create_pdf(output_path):
c = canvas.Canvas(output_path, pagesize=letter)
c.drawString(100, 750, "Hello, World!")
c.drawString(100, 735, "This is a PDF generated using ReportLab.")
c.save()
output_path = 'hello.pdf'
create_pdf(output_path)
4. Summary
Python offers several libraries for working with PDF files, each with its unique capabilities:
- PyPDF2: Good for reading, merging, and splitting PDFs.
- PyMuPDF (fitz): Offers advanced functionalities like text extraction and image extraction.
- ReportLab: Useful for generating and creating complex PDF documents.
Choose the library based on your specific needs for PDF processing or creation.