Logo

dev-resources.site

for different kinds of informations.

DEMO - Voice to PDF - Complete PDF documents with voice commands using the Claude 3 Opus API

Published at
4/27/2024
Categories
claude
promptengineering
speechtotext
pdf
Author
juanstoppa
Author
10 person written this
juanstoppa
open
DEMO - Voice to PDF - Complete PDF documents with voice commands using the Claude 3 Opus API

I spent some time last weekend exploring the Claude 3 Opus API from Anthropic since I have heard so many comments about its potential which appears to surpass ChatGPT, especially when tasked with resolving complex problems such as writing code.

As I was looking into its capabilities, I decided to build an app that allows you to complete a PDF form with voice commands which ended up working much better than I expected.

The idea of the app was to:

  1. Upload a PDF form.
  2. Record voice commands to fill in the form.
  3. Download the completed PDF form.

You can see the demo below:

You can find the app on github at https://github.com/jstoppa/voice_to_pdf

How did I build it?

I spent more time getting the functionality working than building the prompting :-), I used plain JavaScript and NodeJS as I like to keep these demos as plain as possible so I can then pick up the code and use it in other frameworks without having to rely on framework-specific nuances.

The Frontend app:

It runs using Parcel, very simple and easy to setup. The app has 3 files:

  • app.js: main app that brings the entire solution together.
  • dragDrop.js: basic primitives to handle the drag and drop file functionality.
  • pdfHandler.js: this file contains the logic for two fundamental actions:
    • readPdf: used for reading the dropped file and displaying it on the screen, it uses PDF.js to load the file, get all fields and display it on the browser.
    • writePdf: used for writing the final completed PDF after receiving the response from the Claude 3 Opus API, this uses the PDF-lib library to manipulate and modify the final file.
  • speechRecognition.js: this uses the not well-known Web Speech API that comes with the browser and works impressively well. The file handles the action of listening to the voice and displaying the text on the screen.

The Backend app:

It's nothing impressive, just a single NodeJS end point (server.js) to proxy the API call to Claude Opus API, it's mainly used for keeping the API key in the server side and handle the CORS constrains dictated by the browser.

The most important bit it's really the prompt which consists in 3 parts:

  • The task definition: this goes inside the system parameter when calling the Claude Opus API (read more here).The prompt is the following
You are tasked with assisting in the completion of a PDF questionnaire using a provided JSON dataset.

The JSON data includes the following fields for each question in the form:
- id
- question
- isValidQuestion
- answer

Your specific duties include:

1. Question validation: Form the data for Processing you need to
    a. Analyse the "question" field
    b. Determine if the question is valid based on the "isValidQuestion" field
    c. If the question is valid, incorporate the corresponding answer provided in the answer field using the data provided by the user role

2. Strict Adherence to Data: Under no circumstances should you alter, rephrase, or modify any of the the question or id field, your main task is to populate the isValidQuestion and answer fields

3. Format Requirement: Return the results strictly in JSON format. Ensure that the output contains only the required information, maintaining the integrity and structure of the original JSON, including the id fields.\n

4. Valid Questions: Only return the questions that contain a valid question based on the "isValidQuestion" field but still conserving the original id

Important Note: Do not add extraneous text or information outside of the specified JSON structure.
Enter fullscreen mode Exit fullscreen mode
  • List of fields in the PDF: this is a JSON structure with the list of fields in the PDF form, this is generated dynamically based on the document uploaded. As the previous text describes, the task for Claude is to
    • Check if the question is valid by setting the value isValidQuestion to true or false
    • Answer the question using the context given on the user role.
[
    {
        "id": 0,
        "question": "First name",
        "isValidQuestion": true,
        "answer": "John"
    },
    {
        "id": 1,
        "question": "Last name",
        "isValidQuestion": true,
        "answer": "Doe"
    }
]
Enter fullscreen mode Exit fullscreen mode
  • User Role: This describes the context which is the text generated by the speech to text api. The prompt is defined as below where contextualText variable is the text.
{
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "This is the contextual text that you need to use to complete the questionnaire\n\n ${contextualText}"
                }
            ]
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

And that's pretty much it. You can see how, with a very simple app, you can perform a reasonably advanced task which was unimaginable to achieve just a few years ago.

If you like this post, you might also like Exploring the GPT-4 with Vision API using Images and Videos. And if you are completely new to working with AI, I suggest you check Getting started with OpenAI using Python in Windows and Getting started with Azure OpenAI

claude Article's
29 articles in total
Favicon
Integrating Locally running Postgres with Claude Desktop
Favicon
Write tools for LLMs with go - mcp-golang
Favicon
MCP using node on asdf
Favicon
Modify the local bolt.new interface to allow input of the API key
Favicon
Enabling Application Downloads in Local bolt.new
Favicon
Running bolt.new Locally
Favicon
In the Beginning...
Favicon
Dario Amodei: Anthropic CEO on Claude, AGI & the Future of AI & Humanity | Podcast Summary
Favicon
Certainly! Absolutely! I apologize!
Favicon
Claude prompting guide - General tips for effective prompting
Favicon
How I used ChatGPT o1 and Claude for generating a SQL RBAC report and was surprised by the results
Favicon
How to use AI for coding the right way
Favicon
Using Cursor + Claude to Make Full-Stack SaaS Apps
Favicon
Exploring Anthropic Claude: A Safe and Ethical AI Assistant
Favicon
Claude 3.5 API Introductory Tutorial
Favicon
Unlocking Rapid Data Extraction: Groq + OCR and Claude Vision
Favicon
Free AI Chat and AI Art
Favicon
Optimising Function Calling (GPT4 vs Opus vs Haiku vs Sonnet)
Favicon
DEMO - Voice to PDF - Complete PDF documents with voice commands using the Claude 3 Opus API
Favicon
Claude LLM - Pros and Cons Compared with Other LLMs
Favicon
Is Claude Self Aware
Favicon
Guide to Effective Prompt Engineering for ChatGPT and LLM Responses
Favicon
AI powered video summarizer with Amazon Bedrock and Anthropic’s Claude
Favicon
Claude 2.1 Unleashed: The AI Revolution That's Outshining GPT-4
Favicon
AWS Bedrock Claude 2.1 - Return only JSON
Favicon
Claude: 10 Minute Docs Audit
Favicon
New Discoveries in No-Code AI App Building with ChatGPT
Favicon
Meet Claude - The AI Assistant That Understands The World Like You Do
Favicon
La IA de Anthropic, Claude, Supera a ChatGPT

Featured ones: