Logo

dev-resources.site

for different kinds of informations.

Unlocking Rapid Data Extraction: Groq + OCR and Claude Vision

Published at
5/22/2024
Categories
claude
webdev
python
productivity
Author
tarek_eissa
Categories
4 categories in total
claude
open
webdev
open
python
open
productivity
open
Author
11 person written this
tarek_eissa
open
Unlocking Rapid Data Extraction: Groq + OCR and Claude Vision

Introduction

In this article, we explore various methods for extracting data from documents, comparing OCR+LLM with Claude 3 Vision, and delving into fast OCR transformers and cloud-native OCRs. We also provide a code example for implementing OCR as a simple API using docTR and discuss how Groq can be leveraged to achieve the best inference speed for LLMs.

The Use Case: Document Scanning to Save Time

Alt text

Imagine a SaaS platform that helps register invoices for a company. Speed and convenience are paramount, and while some errors are tolerable, the goal is to minimize them. This scenario highlights the need for rapid and reliable data extraction.

Claude 3 Vision vs OCR+LLM

Claude 3 Vision

Claude 3 Vision is known for its speed and cost-efficiency. However, it has limitations, including a tendency to hallucinate (produce errors). It's suitable for simple tasks but may fall short in more complex scenarios.
Alt text

OCR+LLM

OCR+LLM combines Optical Character Recognition (OCR) with Large Language Models (LLMs) to extract and analyze text. This approach offers a balance between accuracy and speed, making it ideal for more detailed data extraction tasks.

Testing Limits with Claude 3 Vision

Using an example invoice, we can define a protocol for our application:

invoice_number: "string"
invoice_date: "string"  # YYYY-MM-DD
due_date: "string"  # YYYY-MM-DD
seller_details:
  seller_name: "string"
  seller_address:
    street_number_and_name: "string"
    city_or_town: "string"
    country: "string"
buyer_details:
  buyer_name: "string"
  buyer_address:
    street_number_and_name: "string"
    city_or_town: "string"
    country: "string"
  buyer_email: "string"
  buyer_phone_number: "string"
products_services:
  - item_number: number
    description: "string"
    quantity: number
    unit_price: number
    total_price: number
sub_total: number
total: number
Enter fullscreen mode Exit fullscreen mode

This pseudo-YAML format outlines the fields we want to extract from an invoice. Testing with Claude 3 Vision yielded response times of about 1 second, which is slower than desired.

OCR Transformers Designed for Speed

Notable OCR Tools

  1. DocTR: Optimized for high-speed performance on both CPU and GPU, requiring only three lines of code to implement.
  2. TrOCR: Pre-trained transformers supported by Microsoft, offering various models.
  3. PaddleOCR: Known for its speed, capable of processing large volumes of images in real-time.
  4. MMOCR: Another fast OCR tool.
  5. Surya: Highly efficient and fast.

Performance testing showed that these OCR tools could achieve processing times as low as 20ms on a GPU.

Cloud-Native OCRs

  1. Azure Form Recognizer: Best performance time around 3 seconds.
  2. Amazon Textract: Processes documents in 3-4 seconds per page.
  3. Google Cloud Vision API and Document AI: Highly efficient and similar to Azure and Amazon.
  4. Abby Cloud OCR: Faster than the other alternatives and offers detailed page representations.

These cloud AI services used to be the go-to solutions but are now often replaced by LLMs due to cost and flexibility advantages.

(https://miro.medium.com/v2/resize:fit:720/format:webp/1*xlWNvAtaM0ObnSKYz6MWwA.png)

Implementing OCR with docTR

Setting Up the OCR API

Here’s a simple example using docTR:

from fastapi import FastAPI, File, UploadFile
from fastapi.responses import JSONResponse
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
from PIL import Image
import io

app = FastAPI(title="OCR Service using docTR")

@app.post("/ocr/")
async def perform_ocr(file: UploadFile = File(...)):
    image_data = await file.read()
    doc = DocumentFile.from_images(image_data)
    model = ocr_predictor(pretrained=True)
    result = model(doc)

    extracted_texts = []
    for page in result.pages:
        for block in page.blocks:
            for line in block.lines:
                line_text = ' '.join([word.value for word in line.words])
                extracted_texts.append(line_text)

    return JSONResponse(content={"ExtractedText": extracted_texts})

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8001)
Enter fullscreen mode Exit fullscreen mode

Docker Setup for the OCR API

# Use an official Python runtime as a parent image, suitable for TensorFlow
FROM tensorflow/tensorflow:latest

# Set the working directory in the container
WORKDIR /app

# Install system dependencies required for OpenCV and WeasyPrint
RUN apt-get update && apt-get install -y \
    libgl1-mesa-glx \
    libpango-1.0-0 \
    libpangocairo-1.0-0 \
    libgdk-pixbuf2.0-0 \
    libffi-dev \
    shared-mime-info

# Install FastAPI and Uvicorn
RUN pip install fastapi uvicorn python-multipart aiofiles Pillow

# Copy the local directory contents into the container
COPY . /app

# Install `doctr` with TensorFlow support
RUN pip install python-doctr[tf]

# Expose the port FastAPI will run on
EXPOSE 8001

# Command to run the FastAPI server on container start
CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8001", "--workers", "4"]
Enter fullscreen mode Exit fullscreen mode

(https://miro.medium.com/v2/resize:fit:720/format:webp/0*OeNLrlTN_iNgWcvX)

Groq: The King of Speed

Groq's architecture provides exceptional performance and cost efficiency, boasting speeds three times faster at half the cost compared to traditional methods.

Using Groq with OCR

from groq import Groq

def send_request_to_groq(content: str) -> str:
    client = Groq(api_key=API_KEY_GROQ)
    completion = client.chat.completions.create(
        model="gemma-7b-it",
        messages=[
            {
                "role": "system",
                "content": "You are an API server that receives content from a document and returns a JSON with the defined protocol"
            },
            {
                "role": "user",
                "content": content
            }
        ],
        temperature=1,
        max_tokens=1024,
        top_p=1,
        stream=False,
        response_format={"type": "json_object"},
        stop=None,
    )

    return completion.choices[0].message.content
Enter fullscreen mode Exit fullscreen mode

The response_format feature of Groq is particularly noteworthy, offering unique capabilities not found in other providers.

Final Implementation

Controller Code

@app.post("/extract_fast")
async def extract_text(file: UploadFile = File(...), extraction_contract: str = Form(...)):
    temp_file = tempfile.NamedTemporaryFile(delete=False)
    shutil.copyfileobj(file.file, temp_file)
    file_path = temp_file.name

    images = convert_pdf_to_images(file_path)

    extracted_text = extract_text_with_pytesseract(images)
    extracted_text = "\n new page --- \n".join(extracted_text)
    extracted_text = systemMessage + "\n####Content\n\n" + extracted_text
    extracted_text = extracted_text + "\n####Structure of the JSON output file\n\n" + extraction_contract
    extracted_text = extracted_text + "\n#### JSON Response\n\n" + jsonContentStarter

    start_time = time.time()
    content = send_request_to_groq(extracted_text)
    elapsed_time = time.time() - start_time
    print(f"send_request_to_groq took {elapsed_time} seconds")

    temp_file.close()
    content = remove_json_format(content)

    return json.loads(content)
Enter fullscreen mode Exit fullscreen mode

Conclusion

For document scanning and data extraction, combining OCR and LLMs on GPUs with Groq provides superior speed and efficiency. This approach is especially beneficial for processing invoices and other documents captured via mobile devices.

claude Article's
29 articles in total
Favicon
Integrating Locally running Postgres with Claude Desktop
Favicon
Write tools for LLMs with go - mcp-golang
Favicon
MCP using node on asdf
Favicon
Modify the local bolt.new interface to allow input of the API key
Favicon
Enabling Application Downloads in Local bolt.new
Favicon
Running bolt.new Locally
Favicon
In the Beginning...
Favicon
Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Podcast Summary
Favicon
Certainly! Absolutely! I apologize!
Favicon
Claude prompting guide - General tips for effective prompting
Favicon
How I used ChatGPT o1 and Claude for generating a SQL RBAC report and was surprised by the results
Favicon
How to use AI for coding the right way
Favicon
Using Cursor + Claude to Make Full-Stack SaaS Apps
Favicon
Exploring Anthropic Claude: A Safe and Ethical AI Assistant
Favicon
Claude 3.5 API Introductory Tutorial
Favicon
Unlocking Rapid Data Extraction: Groq + OCR and Claude Vision
Favicon
Free AI Chat and AI Art
Favicon
Optimising Function Calling (GPT4 vs Opus vs Haiku vs Sonnet)
Favicon
DEMO - Voice to PDF - Complete PDF documents with voice commands using the Claude 3 Opus API
Favicon
Claude LLM - Pros and Cons Compared with Other LLMs
Favicon
Is Claude Self Aware
Favicon
Guide to Effective Prompt Engineering for ChatGPT and LLM Responses
Favicon
AI powered video summarizer with Amazon Bedrock and Anthropic’s Claude
Favicon
Claude 2.1 Unleashed: The AI Revolution That's Outshining GPT-4
Favicon
AWS Bedrock Claude 2.1 - Return only JSON
Favicon
Claude: 10 Minute Docs Audit
Favicon
New Discoveries in No-Code AI App Building with ChatGPT
Favicon
Meet Claude - The AI Assistant That Understands The World Like You Do
Favicon
La IA de Anthropic, Claude, Supera a ChatGPT

Featured ones: