Logo

dev-resources.site

for different kinds of informations.

Unlocking Text from Embedded-Font PDFs: A pytesseract OCR Tutorial

Published at
12/1/2024
Categories
python
tutorial
ocr
pdf
Author
prahladyeri
Categories
4 categories in total
python
open
tutorial
open
ocr
open
pdf
open
Author
11 person written this
prahladyeri
open
Unlocking Text from Embedded-Font PDFs: A pytesseract OCR Tutorial

Extracting text from a PDF is usually straightforward when it's in English and doesn't have embedded fonts. However, once those assumptions are removed, it becomes challenging to use basic python libraries like pdfminer or pdfplumber. Last month, I was tasked with extracting text from a Gujarati-language PDF and importing data fields such as name, address, city, etc., into JSON format.

If the font is embedded in the PDF itself, simple copy-pasting won't work, and using pdfplumber will return unreadable junk text. Therefore, I had to convert each PDF page to an image and then apply OCR using the pytesseract library to "scan" the page instead of just reading it. This tutorial will show you how to do just that.

Things you will need

  • pdfplumber (Python library)
  • pdf2image (Python library)
  • pytesseract (Python library)
  • tesseract-ocr

You can install the Python libraries using pip commands as shown below. For Tesseract-OCR, download and install the software from the official site. pytesseract is just a wrapper around the tesseract software.

pip install pdfplumber
pip install pdf2image
pip install pytesseract
Enter fullscreen mode Exit fullscreen mode

Converting the PDF page to an image

The first step is to convert your PDF page to an image. This extract_text_from_pdf() function does exactly that-you pass the PDF path and the page_num (zero indexed) as parameters. Note that I'm converting the page to black and white first for clarity, this is optional.

# Extract text from a specific page of a PDF
def extract_text_from_pdf(pdf_path, page_num):
    # Use pdfplumber to open the PDF
    pdf = pdfplumber.open(pdf_path)
    print(f"extracting page {page_num}..")
    page = pdf.pages[page_num]
    images = convert_from_path(pdf_path, first_page=page_num+1, last_page=page_num+1)
    image = images[0]
    # Convert to black and white
    bw_image = convert_to_bw(image)
    # Save the B&W image for debugging (optional)
    #bw_image.save("bw_page.png")
    # Perform OCR on the B&W image
    e_text = ocr_image(bw_image)
    open('out.txt', 'w', encoding='utf-8').write(e_text)
    #print("output written to file.")
    try:
        process_text(page_num, e_text)
    except Exception as e:
        print("Error occurred:", e)
    print("done..")

# Convert image to black and white
def convert_to_bw(image):
    # Convert to grayscale
    gray = image.convert('L')
    # Apply threshold to convert to pure black and white
    bw = gray.point(lambda x: 0 if x < 128 else 255, '1')
    return bw

# Perform OCR using Tesseract on a given image
def ocr_image(image_path):
    try:
        # Perform OCR
        custom_config = r'--oem 3 --psm 6 -l guj+eng'
        text = pytesseract.image_to_string(image_path, config=custom_config)  # --psm 6 treats the image as a block of text
        return text
    except Exception as e:
        print(f"Error during OCR: {e}")
        return None

Enter fullscreen mode Exit fullscreen mode

The ocr_image() function uses pytesseract to extract text from the image through OCR. The technical parameters like --oem and --psm control how the image is processed, and the -l guj+eng parameter sets the languages to be read. Since this PDF contained occasional English text, I used guj+eng.

Processing the text

Once you've imported the text using OCR, you can parse it in the format you want. This works similarly to other PDF libraries like pdfplumber or pypdf2.

nums = ['0', '', '', '', '', '', '', '', '', '']

def process_text(page_num, e_text):
    obj = None
    last_surname = None
    last_kramank = None
    print(f"processing page {page_num}..")
    for line in e_text.splitlines():
        line = line.replace('|', '').replace('[', '').replace(']', '')
        parts = [word for word in line.split(' ') if word]
        if len(parts) == 0: continue
        new_rec = True
        for char in parts[0]:
            if char not in nums:
                new_rec = False
                break
        if len(parts) < 2: continue

        if new_rec and len(parts[0]) >= 2: # numbered line
            if len(parts) < 9: continue
            if obj: records.append(obj)
            obj = {}
            last_surname = parts[1]
            obj['kramank'] = parts[0]
            last_kramank = parts[0]
            obj['full_name'] = ' '.join(parts[1:4])
            obj['surname'] = parts[1]
            obj['pdf_page_num'] = page_num + 1
            obj['registered_by'] = parts[4]
            obj['village_vatan'] = parts[5]
            obj['village_mosal'] = parts[6]
            if parts[8] == 'વર્ષ':
                idx = 7
                obj['dob'] = parts[idx] + ' વર્ષ'
                idx += 1
            elif len(parts[7]) == 8 and parts[7][2] == '-':
                idx = 7
                obj['dob'] = parts[idx]
            else:
                print("warning: no date")
                idx = 6
            obj['marital_status'] = parts[idx+1]
            obj['extra_fields'] = '::'.join(parts[idx+2:-2])
            obj['blood_group'] = parts[-1]
        elif parts[0] == last_surname: # new member in existing family
            if obj: records.append(obj)
            obj = {}
            obj['kramank'] = last_kramank
            obj['surname'] = last_surname
            obj['full_name'] = ' '.join(parts[0:3])
            obj['pdf_page_num'] = page_num + 1
            obj['registered_by'] = parts[3]
            obj['village_vatan'] = parts[4]
            obj['village_mosal'] = parts[5]
            if len(parts) <= 6: continue
            if parts[7] == 'વર્ષ': # date exists
                idx = 6
                obj['dob'] = parts[idx] + ' વર્ષ'
                idx += 1
            elif len(parts[6]) == 8 and parts[6][2] == '-':
                idx = 6
                obj['dob'] = parts[idx]
            else:
                print("warning: no date")
                idx = 5
            obj['marital_status'] = parts[idx+1]
            obj['extra_fields'] = '::'.join(parts[idx+2:-2])
            obj['blood_group'] = parts[-1]
        elif obj: # continuation lines
            if ("(" in line and ")" in line) or "મો.ઃ" in line:
                obj['extra_fields'] += ' ' + '::'.join(parts[0:])
    if obj: records.append(obj)        
    jstr = json.dumps(records, indent=4)
    open("guj.json", 'w', encoding='utf-8').write(jstr)
    print(f"written page {page_num} to json..")
Enter fullscreen mode Exit fullscreen mode

Every PDF has its own nuances that must be accounted for. In this case, a new serial number (like 0૧ or 0૨) in the first field signaled a new group when the subsequent field (surname) changed.

pytesseract is a testament to the evolution and advancement in IT technology. About a decade ago, reading or parsing a PDF image using OCR in a non-English language on a modestly configured PC or laptop would have been nearly impossible. This is truly progress! Happy coding, and let me know how it goes in the comments below.

References

pdf Article's
30 articles in total
Favicon
Transforming Starlight into PDF: experience and insights
Favicon
Intelligent PDF Data Extraction and database creation
Favicon
The Struggle of Finding a Free Excel to PDF Converter: My Journey and Solution
Favicon
Guess what? You can make a game inside a PDF!
Favicon
What is Instafill.ai and why it works?
Favicon
How to Save and Open PDFs in Files App with Shortcuts: Specify Path and Filename for Better Access
Favicon
23 Free Online Tools for PDF/Image Conversion & Data Extraction
Favicon
How to Insert Signatures into PDF Documents with HTML5 and JavaScript
Favicon
Easily Manage Multiple PDFs Simultaneously Using Flutter PDF Viewer
Favicon
How to Generate Invoice PDF in Laravel?
Favicon
Using LangChain to Search Your Own PDF Documents
Favicon
Add hyperlink to any Text to another field of same PDF in Angular
Favicon
🚀 Generate Dynamic PDFs in Laravel with DomPDF
Favicon
🛠 Build a Professional CV in PDF with Markdown and Hugo
Favicon
Printer Scanners VS Mobile Scanner - Do Printers Still Have a Role?
Favicon
Merge PDFs Recursively - Python
Favicon
Replace Text in PDFs Using Python
Favicon
Top 9 PDF Generator APIs in 2024
Favicon
HTML2PDF.Lib: A melhor forma de converter HTML para PDF com .Net
Favicon
How to Sign PDFs Online for Free with BoldSign
Favicon
How to Detect and Save Documents to PDF with HTML5 and JavaScript
Favicon
uniapp 入门实战 19:将前端页面导出成pdf
Favicon
Identify and Highlight Spelling Errors in PDFs Using Flutter PDF Viewer
Favicon
Combine PDF Files with PDF API
Favicon
6 Effective Ways to Merge PDF Files Using C#
Favicon
Decoding 1D/2D Barcodes from Multi-Page PDFs Using C++ and Node.js
Favicon
How to add image to PDF in C# (Developer Tutorial)
Favicon
How to Read DataMatrix and Other 1D/2D Barcodes from PDF Files in HTML5 and JavaScript
Favicon
Ferrum Doesn’t Work on Heroku?
Favicon
Unlocking Text from Embedded-Font PDFs: A pytesseract OCR Tutorial

Featured ones: