October 13, 2024

How to Read Contents of PDF Using OCR in Python

Optical Character Recognition (OCR) is a technology used to convert different types of documents—such as scanned paper documents, PDFs, or images—into editable and searchable data. To perform OCR on a PDF file in Python, you typically need to extract images from the PDF and then apply OCR to those images. The common libraries used for this task are PyMuPDF (for PDF extraction) and Pytesseract (for OCR). Here’s a step-by-step guide:

1. Installing Required Libraries

First, you need to install the required libraries. Use pip to install pytesseract, Pillow (for image processing), and PyMuPDF (for PDF processing). Additionally, you will need to install Tesseract-OCR software:

pip install pytesseract pillow pymupdf

To install Tesseract-OCR, follow the instructions on its [official GitHub repository](https://github.com/tesseract-ocr/tesseract) based on your operating system.

2. Extracting Images from PDF

Use PyMuPDF to extract images from a PDF file. Here’s how you can do it:

import fitz  # PyMuPDF

def extract_images_from_pdf(pdf_path):
    pdf_document = fitz.open(pdf_path)
    images = []

    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        image_list = page.get_images(full=True)
        
        for img_index, img in enumerate(image_list):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]
            images.append(image_bytes)
    
    pdf_document.close()
    return images

pdf_path = "example.pdf"
images = extract_images_from_pdf(pdf_path)

This code opens a PDF file and extracts images from each page, returning a list of image bytes.

3. Applying OCR to Extract Text from Images

Once you have extracted images from the PDF, use Pytesseract to perform OCR on these images:

from PIL import Image
import pytesseract
import io

def ocr_images(images):
    text = ""
    for image_bytes in images:
        image = Image.open(io.BytesIO(image_bytes))
        text += pytesseract.image_to_string(image)
    return text

# Perform OCR on the extracted images
text = ocr_images(images)
print(text)

In this code, each image byte is converted into an image object using Pillow, and pytesseract.image_to_string() is used to extract text from the image. The extracted text is then combined into a single string.

4. Putting It All Together

Here’s a complete example that combines image extraction and OCR:

import fitz
from PIL import Image
import pytesseract
import io

def extract_images_from_pdf(pdf_path):
    pdf_document = fitz.open(pdf_path)
    images = []

    for page_number in range(len(pdf_document)):
        page = pdf_document.load_page(page_number)
        image_list = page.get_images(full=True)
        
        for img_index, img in enumerate(image_list):
            xref = img[0]
            base_image = pdf_document.extract_image(xref)
            image_bytes = base_image["image"]
            images.append(image_bytes)
    
    pdf_document.close()
    return images

def ocr_images(images):
    text = ""
    for image_bytes in images:
        image = Image.open(io.BytesIO(image_bytes))
        text += pytesseract.image_to_string(image)
    return text

# Path to the PDF file
pdf_path = "example.pdf"

# Extract images and perform OCR
images = extract_images_from_pdf(pdf_path)
text = ocr_images(images)
print(text)

5. Summary

To read the contents of a PDF using OCR in Python, you need to extract images from the PDF and then apply OCR to those images. This involves using libraries such as PyMuPDF for PDF processing and Pytesseract for performing OCR. The steps include extracting images, converting them to a format suitable for OCR, and then processing these images to extract text.