Logo

dev-resources.site

for different kinds of informations.

Replace Text in PDFs Using Python

Published at
12/24/2024
Categories
pdf
pymupdf
Author
abbazs
Categories
2 categories in total
pdf
open
pymupdf
open
Author
6 person written this
abbazs
open
Replace Text in PDFs Using Python

Introduction

Manipulating PDFs can be a challenging task due to their complex structure, but with Python and the PyMuPDF library, you can perform tasks like searching for text, replacing it, and saving the modified PDF. In this tutorial, we’ll create a Python CLI tool that allows you to find and replace text in a PDF, while preserving the original font, size, and style as closely as possible.


Prerequisites

Before you begin, ensure you have the following installed:

  1. Python: Version 3.7 or above.
  2. pip: Python's package manager.
  3. PyMuPDF: A Python library for working with PDFs.

Install PyMuPDF using pip:

pip install pymupdf
Enter fullscreen mode Exit fullscreen mode

Additionally, we’ll use the click library to create a user-friendly command-line interface (CLI):

pip install click
Enter fullscreen mode Exit fullscreen mode

Code Walkthrough

Here’s the complete code for our CLI tool:

import click
from pathlib import Path
import fitz  # PyMuPDF

@click.command()
@click.argument("input_pdf", type=click.Path(exists=True, dir_okay=False, path_type=Path))
@click.argument("output_pdf", type=click.Path(dir_okay=False, writable=True, path_type=Path))
@click.argument("find_text", type=str)
@click.argument("replace_text", type=str)
def replace_text_in_pdf(input_pdf: Path, output_pdf: Path, find_text: str, replace_text: str):
    """
    Replace FIND_TEXT with REPLACE_TEXT in INPUT_PDF and save the result to OUTPUT_PDF.
    """
    # Open the input PDF
    doc = fitz.open(str(input_pdf))

    for page_num, page in enumerate(doc, start=1):
        # Search for occurrences of find_text
        instances = page.search_for(find_text)

        if not instances:
            click.echo(f"No occurrences of '{find_text}' found on page {page_num}.")
            continue

        click.echo(f"Found {len(instances)} occurrences on page {page_num}. Replacing...")

        for rect in instances:
            # First, redact (remove) the original text
            page.add_redact_annot(rect)
            page.apply_redactions()

            # Default values for text properties
            font = "helv"  # Default to Helvetica
            font_size = 7.0  # Default size
            color = (0, 0, 0)  # Default to black

            # Normalize the color values to range 0 to 1
            normalized_color = tuple(c / 255 for c in color) if isinstance(color, tuple) else (0, 0, 0)

            # Calculate the baseline position for text insertion
            baseline = fitz.Point(rect.x0, rect.y1 - 2.2)  # Adjust the -2 offset as needed

            # Insert the new text at the adjusted position
            page.insert_text(
                baseline,
                replace_text,
                fontsize=font_size,
                fontname=font,
                color=normalized_color,
            )
            click.echo(f"Replaced '{find_text}' with '{replace_text}' on page {page_num}.")

    # Save the modified PDF
    doc.save(str(output_pdf))
    click.echo(f"Modified PDF saved to {output_pdf}.")

if __name__ == "__main__":
    replace_text_in_pdf()
Enter fullscreen mode Exit fullscreen mode

How It Works

  1. Searching for Text:
    The page.search_for(find_text) method identifies all occurrences of the specified text and returns their bounding rectangles.

  2. Redacting Original Text:
    The page.add_redact_annot(rect) and page.apply_redactions() methods remove the original text from the PDF without leaving artifacts.

  3. Inserting Replacement Text:
    Using page.insert_text(), we add the replacement text at the same location as the original, maintaining as much visual similarity as possible.

  4. Saving the PDF:
    Finally, the modified document is saved to the specified output file.


Running the Tool

Save the code to a file, e.g., replace_text_pdf.py. Then, run it from the terminal as follows:

python replace_text_pdf.py input.pdf output.pdf "find_text" "replace_text"
Enter fullscreen mode Exit fullscreen mode

Example

Suppose you have a PDF named example.pdf with the word Python in it, and you want to replace it with PyMuPDF. Run:

python replace_text_pdf.py example.pdf modified_example.pdf "Python" "PyMuPDF"
Enter fullscreen mode Exit fullscreen mode

Important Notes

  1. Font and Style:

    • The tool assumes Helvetica (helv) as the default font.
    • You can customize the font and style by extracting properties from the PDF, though it’s not guaranteed to perfectly match due to PDF limitations.
  2. PDF Structure:

    • PDFs are not inherently designed for text editing. This tool works best with text-based PDFs, not scanned images or PDFs with embedded fonts.
  3. Testing:

    • Always back up your original PDF before using this tool.

Conclusion

With Python and PyMuPDF, replacing text in a PDF is straightforward and powerful. This tutorial covered a CLI tool that can be extended further to suit specific needs. Try it out, and let us know how it works for you in the comments!

Happy coding! 🚀

pdf Article's
30 articles in total
Favicon
Transforming Starlight into PDF: experience and insights
Favicon
Intelligent PDF Data Extraction and database creation
Favicon
The Struggle of Finding a Free Excel to PDF Converter: My Journey and Solution
Favicon
Guess what? You can make a game inside a PDF!
Favicon
What is Instafill.ai and why it works?
Favicon
How to Save and Open PDFs in Files App with Shortcuts: Specify Path and Filename for Better Access
Favicon
23 Free Online Tools for PDF/Image Conversion & Data Extraction
Favicon
How to Insert Signatures into PDF Documents with HTML5 and JavaScript
Favicon
Easily Manage Multiple PDFs Simultaneously Using Flutter PDF Viewer
Favicon
How to Generate Invoice PDF in Laravel?
Favicon
Using LangChain to Search Your Own PDF Documents
Favicon
Add hyperlink to any Text to another field of same PDF in Angular
Favicon
🚀 Generate Dynamic PDFs in Laravel with DomPDF
Favicon
🛠 Build a Professional CV in PDF with Markdown and Hugo
Favicon
Printer Scanners VS Mobile Scanner - Do Printers Still Have a Role?
Favicon
Merge PDFs Recursively - Python
Favicon
Replace Text in PDFs Using Python
Favicon
Top 9 PDF Generator APIs in 2024
Favicon
HTML2PDF.Lib: A melhor forma de converter HTML para PDF com .Net
Favicon
How to Sign PDFs Online for Free with BoldSign
Favicon
How to Detect and Save Documents to PDF with HTML5 and JavaScript
Favicon
uniapp 入门实战 19:将前端页面导出成pdf
Favicon
Identify and Highlight Spelling Errors in PDFs Using Flutter PDF Viewer
Favicon
Combine PDF Files with PDF API
Favicon
6 Effective Ways to Merge PDF Files Using C#
Favicon
Decoding 1D/2D Barcodes from Multi-Page PDFs Using C++ and Node.js
Favicon
How to add image to PDF in C# (Developer Tutorial)
Favicon
How to Read DataMatrix and Other 1D/2D Barcodes from PDF Files in HTML5 and JavaScript
Favicon
Ferrum Doesn’t Work on Heroku?
Favicon
Unlocking Text from Embedded-Font PDFs: A pytesseract OCR Tutorial

Featured ones: