Logo

dev-resources.site

for different kinds of informations.

Intro to Llama on Graviton

Published at
8/28/2024
Categories
llm
arm
tutorial
aws
Author
bluevalhalla
Categories
4 categories in total
llm
open
arm
open
tutorial
open
aws
open
Author
12 person written this
bluevalhalla
open
Intro to Llama on Graviton

Note: Banner image generated by AI.

Are you curious about how to supercharge your application with AI while cutting costs? Discover how running Large Language Models on AWS Graviton can offer you the necessary performance at a fraction of the price.

It has been less than two years since ChatGPT changed the virtual face of AI. Since then, large language models (LLMs) have been all the rage. Adding a chatbot in your application may dramatically increase user interaction, but LLMs require complicated and costly infrastructure. Or do they?

After watching the “Generative AI Inference using AWS Graviton Processors” session from the AWS AI Infrastructure Day, I was inspired to share how you can run an LLM using the same Graviton processors as the rest of your application.

In this post, we will:

  • Set up a Graviton instance.
  • Following the steps (with some modifications) in "Deploy a Large Language Model (LLM) chatbot on Arm servers” from the Arm Developer Hub to:
    • Download and compile llama.cpp
    • Download a Meta Llama 3.1 model using huggingface-cli
    • Re-quantize the model using llama-quantize to optimize it for the target Graviton platform
    • Run the model using llama-cli
    • Evaluate performance
  • Compare different instances of Graviton and discuss the pros and cons of each
  • Point to resources for getting started

Subsequent posts will dive deeper into application use cases, costs, and sustainability.

Set up a Graviton instance

First, let’s focus on the Graviton3-based r7g.16xlarge. This is a common instance type these days. I’ll be running it in us-west-2. Using the console, navigate to EC2 Instances and select “Launch instances”. There are only a few fields necessary for a quick test:

  • Name: this is up to you; I have called mine ed-blog-r7g-16xl
  • Application and OS Images
    • AMI: I am using Ubuntu Server 24.04 LTS (the default if you select Ubuntu)
    • Architecture: Choose 64-bit (Arm)
  • Instance type: r7g.16xlarge
  • Key pair: Select an existing one or create a new one
  • Configure storage: I’m bumping this up to 32 GiB to make sure I have room for the code and Meta Llama models.

AWS Console EC2 Launch Settings

You can leave defaults for the rest, just click “Launch instance” after the Summary.

AWS Console EC2 Launch Summary

Once the instance has started, you can connect using your favorite method. For simplicity, I will use the EC2 Instance Connect method, which will provide a terminal in your browser window:

AWS Web Terminal

Build and Run Meta Llama 3.1

To build and run Meta Llama 3.1, we will follow the steps (with some modifications) in "Deploy a Large Language Model (LLM) chatbot on Arm servers” from the Arm Developer Hub to:

Download and compile llama.cpp

First, we install any prerequisites:

sudo apt update
sudo apt install make cmake -y
sudo apt install gcc g++ -y
sudo apt install build-essential -y
Enter fullscreen mode Exit fullscreen mode

Then we clone llama.cpp and build it (the -j$(nproc) flag will use all available vCPU cores to speed up compilation):

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)
Enter fullscreen mode Exit fullscreen mode

Finally, we can test it using the help flag:

./llama-cli -h
Enter fullscreen mode Exit fullscreen mode

Download Meta Llama 3.1

Next, we’ll set up a virtual environment for Python packages:

sudo apt install python-is-python3 python3-pip python3-venv -y
python -m venv venv
source venv/bin/activate
Enter fullscreen mode Exit fullscreen mode

Now install HuggingFace Hub and use it to download a 4-bit quantized version of Meta Llama 3.1:

pip install huggingface_hub
huggingface-cli download cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf dolphin-2.9.4-llama3.1-8b-Q4_0.gguf --local-dir . --local-dir-use-symlinks False
Enter fullscreen mode Exit fullscreen mode

Re-quantize the model

The model we downloaded is already 4-bit quantized (half-byte per weight). This gives us a 4x improvement in model size compared with the original bfloat16 (2-byte per weight). However, the width of the Scalable Vector Extension (SVE) is different for Graviton3 (2x256-bit SVE) and Graviton4 (4x128-bit SVE2). Graviton2 does not have SVE but will use 2x128-bit Arm Neon technology. To maximize the throughput for each generation, you should re-quantize the model with the following block layouts:

  • Graviton2: 4x4 (Q4_0_4_4)
  • Graviton3: 8x8 (Q4_0_8_8)
  • Graviton4: 4x8 (Q4_0_4_8)

For the Graviton3 instance, we will re-quantize the model using llama-quantize as follows:

./llama-quantize --allow-requantize dolphin-2.9.4-llama3.1-8b-Q4_0.gguf dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf Q4_0_8_8
Enter fullscreen mode Exit fullscreen mode

Run the model

Finally, we can run the model using llama-cli. There are a few arguments we will use:

  • Model (-m): The optimized model for Graviton3, dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf
  • Prompt (-p): As a test prompt, we’ll use “Building a visually appealing website can be done in ten simple steps”
  • Response length (-n): We’ll ask for 512 characters
  • Thread count (-t): We want to use all 64 of the vCPUs

Here’s the command:

./llama-cli -m dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf -p "Building a visually appealing website can be done in ten simple steps:" -n 512 -t 64
Enter fullscreen mode Exit fullscreen mode

When you run the command, you should see several parameters print out followed by the generated text (starting with the prompt) and finally performance statistics:

llama-cli Terminal Output

Evaluate Performance

The two lines highlighted above are the prompt evaluation time and the text generation time. These are two of the key metrics for user experience with LLMs. The prompt evaluation time relates to how long it takes the LLM to process the prompt and start to respond. The text generation time is how long it takes to generate the output. In both cases, the metric can be viewed in terms of tokens per second (T/s). For our run we see:

Evaluation: 278.2 T/s
Generation: 47.7 T/s

If you run the standard Q4_0 quantization with everything else the same, as with this command:

./llama-cli -m dolphin-2.9.4-llama3.1-8b-Q4_0.gguf -p "Building a visually appealing website can be done in ten simple steps:" -n 512 -t 64
Enter fullscreen mode Exit fullscreen mode

You will see a decrease in performance:

Evaluation: 164.6 T/s
Generation: 28.1 T/s

Using the correct quantization format (Q4_0_8_8, in this case) you get close to 70% improvement on evaluation and generation!

When you are done with your tests, don’t forget to stop the instance!

Comparing Graviton-based instances

Using the process above, we can run the same model on similarly equipped Graviton2 and Graviton4-based instances. Using the optimum quantization format for each, we can see an increase in performance from generation to generation:

Generation Instance Quant Eval (T/s) Gen (T/s)
Graviton2 r6g.16xlarge Q4_0_4_4 175.4 25.1
Graviton3 r7g.16xlarge Q4_0_8_8 278.2 42.7
Graviton4 r8g.16xlarge Q4_0_4_8 341.8 65.6

The performance differences are due to vectorization extensions, caching, clock speed, and memory bandwidth. You may see some variation at lower vCPU/thread counts and when using different instance types: general purpose (M), compute optimized (C), etc. Graviton4 also has more cores per chip, with instances available up to 192 vCPUs!

Determining which instances meet your needs depends on your application. For interactive applications, you may want low evaluation latency and a text generation speed of more than 10 Tok/s. Any of the 64 vCPU instances can easily meet the generation requirement, but you may need to consider the expected size of prompts to determine evaluation latency. Graviton2 performance shows that serverless solutions using AWS Lambda may be possible, especially for non-time critical applications.

Get Started!

As you can see, running Meta Llama models on AWS Graviton is straightforward. This is an easy way to test out models for your own applications. In many cases, Graviton may be the most cost-effective way of integrating LLMs with your application. I’ll explore this further in the coming months.

In the meantime, here are some resources to help you get started:

Have fun!

arm Article's
30 articles in total
Favicon
Understanding the Difference Between x86 and ARM CPUs: Instruction Set Comparison and Their Impact
Favicon
Using flutter with native resources on apple silicon processors
Favicon
Reviving the Remix Mini PC: A Guide to Running ARM-based OS Images
Favicon
Understanding the Differences Between FPGA, AVR, PIC, and ARM Microcontrollers
Favicon
Google Axion: A New Leader in ARM Server Performance
Favicon
Will ARM Processors Surpass x86 in Performance?
Favicon
Magento 2 ARM Ubuntu Server 24.04 AMD installation sh script
Favicon
How Arm’s Success in Data Centers is Shaping the Future of Chip Technology
Favicon
Intro to Llama on Graviton
Favicon
Investigate performance with Process Watch on AWS Graviton processors
Favicon
Adoption of AWS Graviton ARM instances (and what results we’ve seen)
Favicon
Core Architectural components of Microsoft Azure
Favicon
ARM Template: Azure SQL Server
Favicon
ARM Template: Azure Webapp
Favicon
Apple Silicon, State-of-the-art ARM CPU
Favicon
AWS Graviton Migration - Embracing the Path to Modernization
Favicon
Setting Up ARM VM on Proxmox VE
Favicon
How to Deploy a .NET isolated Azure Function in One-Click using Zip Deploy
Favicon
Cloud : un peu d’ARM avec votre cluster Kubernetes ?
Favicon
ARM vs x86 em Docker
Favicon
Conociendo ARM32
Favicon
Choose Wisely: RISC-V vs. ARM - Architectures of the Future
Favicon
Manage your Azure resources using automation tasks
Favicon
Exploring the Differences: ARM vs. RISC-V Architecture
Favicon
Use AWS Graviton processors on AWS Fargate with Copilot
Favicon
GitHub Actions for Easy ARM64 and AMD64 Docker Image Builds
Favicon
Build Arm Docker images five times faster on native hardware
Favicon
Faster Docker builds for Arm without emulation
Favicon
Azure Bicep - Finally functions to manipulate CIDRs
Favicon
Ampere Computing and Jack Aboutboul of AlmaLinux Talk Arm64

Featured ones: