dev-resources.site

for different kinds of informations.

AI powered video summarizer with Amazon Bedrock and Anthropic’s Claude

Published at

1/3/2024

Solution overview

I will use a step functions to orchestrate the different steps involved in the summary and audio generation :

🔍 Let’s break this down:

The Get Video Transcript function retrieves the transcript from a specified YouTube video URL. Upon successful retrieval, the transcript is stored in an S3 bucket, ready for processing in the next step.
Generate Model Parameters function retrieves the transcript from the bucket and generates the prompt and inference parameters specific to Anthropic’s Claude v2 model. These parameters are then stored in the bucket for use by the Bedrock API in the subsequent step.
Invoking the Bedrock API is achieved through the step functions’ AWS SDK integration, enabling the execution of the model inferences with inputs stored in the bucket. This step generates a structured JSON containing the summary.
Generate audio form summary relies on Amazon Polly to perform speech synthesis from the summary produced in the previous step. This step returns the final output containing the video summary in text format, as well as a presigned URL for the generated audio file.
The bucket serves as a state storage used across all the steps of the state machine. In fact, we don’t know the size of generated video transcript upfront; it might reach the Step Functions’ payload size limit of 256 KB in some lengthy videos.

On using Anthoropic’s Claude 2.1

At the time of writing, Claude 2.1 model supports 200K tokens, an estimated word count of 150K. It provides also a good accuracy over long documents, making it well-suited for summarizing lengthy video transcripts.

TL;DR

You will find the complete source code here 👇
GitHub - ziedbentahar/yt-video-summarizer-with-bedrock

I will use NodeJs, typescript and CDK for IaC.

Solution details

1- Enabling Anthropic’s Claude v2 in your account

Amazon Bedrock offers a range of foundational models, including Amazon Titan, Anthropic’s Claude, Meta Llama2, etc., which are accessible through Bedrock APIs. By default, these foundational models are not enabled; they must be enabled through the console before use.

We’ll request access to Anthropic’s Claude models. But first we’ll need to submit a use case details:

2- Getting transcripts from Youtube Videos

I will rely on this lib for the video transcript extraction (It feels like a cheat code 😉) ; in fact, this library makes use of an unofficial YouTube API without relying on a headless Chrome solution. For now, it yields good results on several YouTube videos, but I might explore a more robust solutions in the future :

The extracted transcript is then stored on the s3 bucket using ${requestId}/transcript as a key.

You can find the code for this lambda function here

3- Finding the adequate prompt and generating model inference parameters

At the time of writing, Bedrock currently only supports Claude’s Text Completions API. Prompts must be wrapped in \n\nHuman: and \n\nAssistant: markers to let Claude understand the conversation context.

Here is the prompt; I find that it produces good results for our use case:

    You are a video transcript summarizer.
    Summarize this transcript in a third person point of view in 10 sentences.
    Identify the speakers and the main topics of the transcript and add them in the output as well.
    Do not add or invent speaker names if you not able to identify them.
    Please output the summary JSON format conforming to this JSON schema:
    {
      "type": "object",
      "properties": {
        "speakers": {
          "type": "array",
          "items": {
            "type": "string"
          }
        },
        "topics": {
          "type": "string"
        },
        "summary": {
          "type": "array",
          "items": {
            "type": "string"
          }
        }
      }
    }

    <transcript>{{transcript}}</transcript>

🤖 Helping Claude producing good results:

To clearly mark to the transcript to summarize, we use XML tags. Claude will specifically focus on the structure encapsulated by these XML tags. I will be substituting {{transcript}} string with the actual video transcript.
To assist Claude in generating a reliable JSON output format, I include in the prompt the JSON schema that needs to be adhered to.
Finally, I also need to inform Claude that I want to generate only a concise JSON response without unnecessary chattiness, meaning without including a preamble and postscript while returning the JSON payload:

\n\nHuman:{{prompt}}\n\nAssistant:{

Note that the full prompt ends with a trailing {

As mentioned on the section above, we will store this generated prompt as well as the model parameters in the bucket so that It can be used as an input of Bedrock API:

      const modelParameters = {
        prompt,
        max_tokens_to_sample: MAX_TOKENS_TO_SAMPLE,
        top_k: 250,
        top_p: 1,
        temperature: 0.2,
        stop_sequences: ["Human:"],
        anthropic_version: "bedrock-2023-05-31",
      };

You can follow this link for the full code of the generate-model-parameters lambda function.

4- Invoking Claude Model

In this step, we’ll avoid writing custom lambda function to invoke Bedrock API. Instead, we’ll use Step functions direct SDK integration. This state loads from the bucket the model inference parameters that were generated in the previous step:

☝️ Note: As we instructed Claude to generate the response in JSON format, the completion API response misses a leading { as Claude outputs the rest of the requested JSON schema.

We use intrinsic functions on the state’s ResultSelector to add the missing opening curly brace and to format the state output in a well formed JSON payload :

    ResultSelector: {
      "id.$": "$$.Execution.Name",
      "summaryTaskResult.$":
        "States.StringToJson(States.Format('\\{{}', $.Body.completion))",
    }

I have to admit, it is not ideal but this helps get by without writing a custom Lambda function.

5- Generating audio from video summary

This step is heavily inspired by this previous blog post. Amazon Polly generates the audio from the video summary:

Here are the details of synthesize function:

Once the audio generated, we store it on the S3 bucket and we generate a presigned Url so it can be downloaded afterwards.

☝️ On language detection : In this example, I am not performing language detection; by default, I am assuming that the video is in English. You can find in my previous article how to perform such a process in speech synthesis. Alternatively, We can also leverage Claude model capabilities to detect the language of the transcript.

6- Defining the state machine

Alright, let’s put it all together and let’s take a look at the CDK definition of the state machine:

In order to be able to invoke Bedrock API, we’ll need to add this policy to the workflow’s role (And it’s important to remember granting the S3 bucket read & write permissions to the state machine):

Wrapping up

I find creating generative AI based applications to be a fun exercise, I am always impressed by how quickly we can develop such applications by combining Serverless and Gen AI.

Certainly, there is room for improvement to make this solution production-grade. This workflow can be integrated into a larger process, allowing the video summary to be sent asynchronously to a client, and let’s not forget robust error handling.

Follow this link to get the source code for this article.

Thanks for reading and hope you enjoyed it !