Logo

dev-resources.site

for different kinds of informations.

Masking confidential data in prompts using Regex and spaCy

Published at
12/2/2024
Categories
promptengineering
python
regex
Author
aditykris
Categories
3 categories in total
promptengineering
open
python
open
regex
open
Author
9 person written this
aditykris
open
Masking confidential data in prompts using Regex and spaCy

People have privacy concerns regarding the popular LLMs like OpenAI, Gemini, Claude etc...,. We don't really know what happens behind the screens unless it's an open-source model. So, we have to be careful from our side.

First thing would be handling of information that we pass to the LLMs. Experts recommends avoiding any including confidential information or personal identifiers in the prompts. Sounds easier, but as context size of LLMs are increasing we can pass large texts to the models. So, it might become hard review and mask all the identifiers. 

So, I tried to create python script that would detect and mask identifiers and confidential information. Regex is magical and implemented to recognize different confidential information and replace it with masks. Also used spacy library to detect common identifiers such as name, place etc.,

Note: Right now, this is suitable for Indian context, but common identifier can still be detected. 

So let' look at the implementation (I have taken help of LLM for implementation)
If you want to skip the explanation. 

Here's the link to the code base: aditykris/prompt-masker-Indian-context
Importing the necessary module/libraries

import re 

from typing import Dict, List, Tuple

import spacy

nlp = spacy.load("en_core_web_sm")
Enter fullscreen mode Exit fullscreen mode

You have to manually install "en_core_web_sm" using the below snippet

python -m spacy download en_core_web_sm

Setting the common Indian confidential information.

class IndianIdentifier:
    '''Regex for common Indian identifiers'''
    PAN = r'[A-Z]{5}[0-9]{4}[A-Z]{1}'
    AADHAR = r'[2-9]{1}[0-9]{3}\s[0-9]{4}\s[0-9]{4}'
    INDIAN_PASSPORT = r'[A-PR-WYa-pr-wy][1-9]\d\s?\d{4}[1-9]'
    DRIVING_LICENSE = r'(([A-Z]{2}[0-9]{2})( )|([A-Z]{2}-[0-9]{2}))((19|20)[0-9][0-9])[0-9]{7}'
    UPI_ID = r'[\.\-a-z0-9]+@[a-z]+'
    INDIAN_BANK_ACCOUNT = r'\d{9,18}'
    IFSC_CODE = r'[A-Z]{4}0[A-Z0-9]{6}'
    INDIAN_PHONE_NUMBER = r'(\+91|\+91\-|0)?[789]\d{9}'
    EMAIL = r'[\w\.-]+@[\w\.-]+\.\w+'

    @classmethod
    def get_all_patterns(cls) -> Dict[str, str]:
        """Returns all regex patterns defined in the class"""
        return {
            name: pattern 
            for name, pattern in vars(cls).items() 
            if isinstance(pattern, str) and not name.startswith('_')
        }
Enter fullscreen mode Exit fullscreen mode

So, I was revising the python classes and methods so went onto to implement it here. 
I found the regex of these identifiers from DebugPointer, was very helpful.
Now to the detection function. Simple re.finditer() was used to loop through different patterns to find matches. Matches are stored in into a list.

def find_matches(text: str, pattern: str) -> List[Tuple[int, int, str]]:
    """
    Find all matches of a pattern in text and return their positions and matched text
    """
    matches = []
    for match in re.finditer(pattern, text):
        matches.append((match.start(), match.end(), match.group()))
    return matches
Enter fullscreen mode Exit fullscreen mode

Used a simple dictionary to store replacement texts. Wrapped it up in a function to return the replacements text.

def get_replacement_text(identifier_type: str) -> str:
    """
    Returns appropriate replacement text based on the type of identifier
    """
    replacements = {
        'PAN': '[PAN_NUMBER]',
        'AADHAR': '[AADHAR_NUMBER]',
        'INDIAN_PASSPORT': '[PASSPORT_NUMBER]',
        'DRIVING_LICENSE': '[DL_NUMBER]',
        'UPI_ID': '[UPI_ID]',
        'INDIAN_BANK_ACCOUNT': '[BANK_ACCOUNT]',
        'IFSC_CODE': '[IFSC_CODE]',
        'INDIAN_PHONE_NUMBER': '[PHONE_NUMBER]',
        'EMAIL': '[EMAIL_ADDRESS]',
        'PERSON': '[PERSON_NAME]',
        'ORG': '[ORGANIZATION]',
        'GPE': '[LOCATION]'
    }
    return replacements.get(identifier_type, '[MASKED]')
Enter fullscreen mode Exit fullscreen mode

Ah! main part begins.

def analyze_identifiers(text: str) -> Tuple[str, Dict[str, List[str]]]:
    """
    Function to identify and hide sensitive information.
    Returns:
        - masked_text: Text with all sensitive information masked
        - found_identifiers: Dictionary containing all identified sensitive information
    """
    # Initialize variables
    masked_text = text
    found_identifiers = {}
    positions_to_mask = []

    # First, find all regex matches
    for identifier_name, pattern in IndianIdentifier.get_all_patterns().items():
        matches = find_matches(text, pattern)
        if matches:
            found_identifiers[identifier_name] = [match[2] for match in matches]
            positions_to_mask.extend(
                (start, end, identifier_name) for start, end, _ in matches
            )

    # Then, process named entities using spaCy
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in ["PERSON", "ORG", "GPE"]:
            positions_to_mask.append((ent.start_char, ent.end_char, ent.label_))
            if ent.label_ not in found_identifiers:
                found_identifiers[ent.label_] = []
            found_identifiers[ent.label_].append(ent.text)

    # Sort positions by start index in reverse order to handle overlapping matches
    positions_to_mask.sort(key=lambda x: x[0], reverse=True)

    # Apply masking
    for start, end, identifier_type in positions_to_mask:
        replacement = get_replacement_text(identifier_type)
        masked_text = masked_text[:start] + replacement + masked_text[end:]

    return masked_text, found_identifiers
Enter fullscreen mode Exit fullscreen mode

This function takes the prompt as input and returns the masked prompt along with identified elements as dictionary.

Let me explain it one by one.

Following loop through regex of different identifiers to find match in the prompt. If found, then it will:
 1. Store identified information in a dictionary with identifier type as its key to keep track.
 2. Notes the positions and stores it in positions_to_mask so that we can apply masking later.

for identifier_name, pattern in IndianIdentifier.get_all_patterns().items():
        matches = find_matches(text, pattern)
        if matches:
            found_identifiers[identifier_name] = [match[2] for match in matches]
            positions_to_mask.extend(
                (start, end, identifier_name) for start, end, _ in matches
            )
Enter fullscreen mode Exit fullscreen mode

Now It's spacy time. It's great a library for natural language processing (nlp) tasks. We can extract the identifiers from text using the nlp module.
Currently, I have used to it detect Name, Organization and locations.
This work as same above loop for identifying and storing location.

 # Then, process named entities using spaCy
    doc = nlp(text)
    for ent in doc.ents:
        if ent.label_ in ["PERSON", "ORG", "GPE"]:
            positions_to_mask.append((ent.start_char, ent.end_char, ent.label_))
            if ent.label_ not in found_identifiers:
                found_identifiers[ent.label_] = []
            found_identifiers[ent.label_].append(ent.text)
Enter fullscreen mode Exit fullscreen mode

In some test cases, I noticed that some masks were missing out and it was mainly due overlapping of the identifiers. So, Sorting in reverse order helped in solving it.

 

 # Sort positions by start index in reverse order to handle overlapping matches
    positions_to_mask.sort(key=lambda x: x[0], reverse=True) 
Enter fullscreen mode Exit fullscreen mode

Then Finally, we are masking happens using data from found_identifiers and positions_to_mask.

   # Apply masking
    for start, end, identifier_type in positions_to_mask:
        replacement = get_replacement_text(identifier_type)
        masked_text = masked_text[:start] + replacement + masked_text[end:]

    return masked_text, found_identifiers
Enter fullscreen mode Exit fullscreen mode

A sample input of this program would be:

Input:

Mr. John Doe's PAN number is ABCDE1234F and Aadhar is 1234 5678 9012.
He lives in Mumbai and works at TechCorp.
His phone number is +919876543210 and email is [email protected].
Bank account: 123456789012 with IFSC: SBIN0123456
Enter fullscreen mode Exit fullscreen mode

Output:
Masked Text:

Mr. [PERSON_NAME]'s [ORGANIZATION] number is [PERSON_NAME]R] and [LOCATION] is 1234 5678 9012.
He lives in [LOCATION] and works at [ORGANIZATION].
His phone number is [PHONE_NUMBER]T] and email is [EMAIL_ADDRESS]IZATION] account: [BANK_ACCOUNT] with [ORGANIZATION]: [IFSC_CODE]

Identified sensitive information:
PAN: ['ABCDE1234F']
UPI_ID: ['john.doe@example']
INDIAN_BANK_ACCOUNT: ['919876543210', '123456789012']
IFSC_CODE: ['SBIN0123456']
INDIAN_PHONE_NUMBER: ['+919876543210']
EMAIL: ['[email protected]']
PERSON: ['John Doe', 'ABCDE1234F']
ORG: ['PAN', 'TechCorp', 'Bank', 'IFSC']
GPE: ['Aadhar', 'Mumbai']
Enter fullscreen mode Exit fullscreen mode
promptengineering Article's
30 articles in total
Favicon
How RAG works? Retrieval Augmented Generation Explained
Favicon
How I Created & Published A Chrome Extension With AI?
Favicon
Temporary Chat Isn't That Temporary | A Look at The Custom Bio and User Instructions in ChatGPT
Favicon
Master Advanced Techniques in Prompt Engineering Today!
Favicon
Llama Classification Prompt Optimization Strategies Revealed
Favicon
Advanced Prompt Engineering Techniques for Foundation Models
Favicon
ChatGPT Prompts for Limitless Creativity and Productivity
Favicon
Comprehensive Guide to Few-Shot Prompting Using Llama 3
Favicon
Cracking the Code of AI Conversations: The Art of Prompt Engineering
Favicon
This One Weird Trick Makes AI Systems Smarter: Teaching Them to Doubt 🤖
Favicon
[Boost]
Favicon
Speeding up your GitHub workflow with Cline 3.0 and MCP
Favicon
AI Engineer's Tool Review: Athina
Favicon
How to Design Robust AI Systems Against Prompt Injection Attacks
Favicon
ChatGPT Prompts That Will Change Your Life in 2025
Favicon
Elevate Your Conversations with Awesome ChatGPT Prompts
Favicon
Masking confidential data in prompts using Regex and spaCy
Favicon
LaPrompt Marketplace: The #1 Resource of Verified GPT Prompts
Favicon
Supercharging AI Code Reviews: Our Journey with Mistral-Large-2411
Favicon
Improving LLM Code Generation with Prompt Engineering
Favicon
Prompting for purchasing: Shopping lists & evaluation matrixes (Part 2)
Favicon
AI Prompt Library
Favicon
How Smart Token Optimization Can Slash Your LLM Costs: A Prompt Engineering Guide
Favicon
AI Engineer's Review: Poe - Platform for accessing various AI models like Llama, GPT, Claude
Favicon
El arte de los prompts: Desglosando el diseño de Grok en X
Favicon
Taming the Cost of Prompt Chaining with GemBatch
Favicon
The Role of Writing Prompts in Streamlining Creative Processes
Favicon
chatGPT - C programming Linux Windows cross-platform - code review request
Favicon
Leveraging Multi-Prompt Segmentation: A Technique for Enhanced AI Output
Favicon
From Scribbles to Spells: Perfecting Instructions in Copilot Studio

Featured ones: