Optical Character Recognition (OCR) is a technology used to convert different types of documents—such as scanned paper documents, PDFs, or images—into editable and searchable data. To perform OCR on a PDF file in Python, you typically need to extract images from the PDF and then apply OCR to those images. The common libraries used for this task are PyMuPDF
(for PDF extraction) and Pytesseract
(for OCR). Here’s a step-by-step guide:
1. Installing Required Libraries
First, you need to install the required libraries. Use pip to install pytesseract
, Pillow
(for image processing), and PyMuPDF
(for PDF processing). Additionally, you will need to install Tesseract-OCR software:
pip install pytesseract pillow pymupdf
To install Tesseract-OCR, follow the instructions on its [official GitHub repository](https://github.com/tesseract-ocr/tesseract) based on your operating system.
2. Extracting Images from PDF
Use PyMuPDF
to extract images from a PDF file. Here’s how you can do it:
import fitz # PyMuPDF
def extract_images_from_pdf(pdf_path):
pdf_document = fitz.open(pdf_path)
images = []
for page_number in range(len(pdf_document)):
page = pdf_document.load_page(page_number)
image_list = page.get_images(full=True)
for img_index, img in enumerate(image_list):
xref = img[0]
base_image = pdf_document.extract_image(xref)
image_bytes = base_image["image"]
images.append(image_bytes)
pdf_document.close()
return images
pdf_path = "example.pdf"
images = extract_images_from_pdf(pdf_path)
This code opens a PDF file and extracts images from each page, returning a list of image bytes.
3. Applying OCR to Extract Text from Images
Once you have extracted images from the PDF, use Pytesseract
to perform OCR on these images:
from PIL import Image
import pytesseract
import io
def ocr_images(images):
text = ""
for image_bytes in images:
image = Image.open(io.BytesIO(image_bytes))
text += pytesseract.image_to_string(image)
return text
# Perform OCR on the extracted images
text = ocr_images(images)
print(text)
In this code, each image byte is converted into an image object using Pillow
, and pytesseract.image_to_string()
is used to extract text from the image. The extracted text is then combined into a single string.
4. Putting It All Together
Here’s a complete example that combines image extraction and OCR:
import fitz
from PIL import Image
import pytesseract
import io
def extract_images_from_pdf(pdf_path):
pdf_document = fitz.open(pdf_path)
images = []
for page_number in range(len(pdf_document)):
page = pdf_document.load_page(page_number)
image_list = page.get_images(full=True)
for img_index, img in enumerate(image_list):
xref = img[0]
base_image = pdf_document.extract_image(xref)
image_bytes = base_image["image"]
images.append(image_bytes)
pdf_document.close()
return images
def ocr_images(images):
text = ""
for image_bytes in images:
image = Image.open(io.BytesIO(image_bytes))
text += pytesseract.image_to_string(image)
return text
# Path to the PDF file
pdf_path = "example.pdf"
# Extract images and perform OCR
images = extract_images_from_pdf(pdf_path)
text = ocr_images(images)
print(text)
5. Summary
To read the contents of a PDF using OCR in Python, you need to extract images from the PDF and then apply OCR to those images. This involves using libraries such as PyMuPDF
for PDF processing and Pytesseract
for performing OCR. The steps include extracting images, converting them to a format suitable for OCR, and then processing these images to extract text.