dev-resources.site

for different kinds of informations.

Intro to Llama on Graviton

Published at

8/28/2024

Set up a Graviton instance

First, let’s focus on the Graviton3-based r7g.16xlarge. This is a common instance type these days. I’ll be running it in us-west-2. Using the console, navigate to EC2 Instances and select “Launch instances”. There are only a few fields necessary for a quick test:

Name: this is up to you; I have called mine ed-blog-r7g-16xl
Application and OS Images
- AMI: I am using Ubuntu Server 24.04 LTS (the default if you select Ubuntu)
- Architecture: Choose 64-bit (Arm)
Instance type: r7g.16xlarge
Key pair: Select an existing one or create a new one
Configure storage: I’m bumping this up to 32 GiB to make sure I have room for the code and Meta Llama models.

You can leave defaults for the rest, just click “Launch instance” after the Summary.

Once the instance has started, you can connect using your favorite method. For simplicity, I will use the EC2 Instance Connect method, which will provide a terminal in your browser window:

Build and Run Meta Llama 3.1

To build and run Meta Llama 3.1, we will follow the steps (with some modifications) in "Deploy a Large Language Model (LLM) chatbot on Arm servers” from the Arm Developer Hub to:

Download and compile llama.cpp

First, we install any prerequisites:

sudo apt update
sudo apt install make cmake -y
sudo apt install gcc g++ -y
sudo apt install build-essential -y

Then we clone llama.cpp and build it (the -j$(nproc) flag will use all available vCPU cores to speed up compilation):

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make -j$(nproc)

Finally, we can test it using the help flag:

./llama-cli -h

Download Meta Llama 3.1

Next, we’ll set up a virtual environment for Python packages:

sudo apt install python-is-python3 python3-pip python3-venv -y
python -m venv venv
source venv/bin/activate

Now install HuggingFace Hub and use it to download a 4-bit quantized version of Meta Llama 3.1:

pip install huggingface_hub
huggingface-cli download cognitivecomputations/dolphin-2.9.4-llama3.1-8b-gguf dolphin-2.9.4-llama3.1-8b-Q4_0.gguf --local-dir . --local-dir-use-symlinks False

Re-quantize the model

The model we downloaded is already 4-bit quantized (half-byte per weight). This gives us a 4x improvement in model size compared with the original bfloat16 (2-byte per weight). However, the width of the Scalable Vector Extension (SVE) is different for Graviton3 (2x256-bit SVE) and Graviton4 (4x128-bit SVE2). Graviton2 does not have SVE but will use 2x128-bit Arm Neon technology. To maximize the throughput for each generation, you should re-quantize the model with the following block layouts:

Graviton2: 4x4 (Q4_0_4_4)
Graviton3: 8x8 (Q4_0_8_8)
Graviton4: 4x8 (Q4_0_4_8)

For the Graviton3 instance, we will re-quantize the model using llama-quantize as follows:

./llama-quantize --allow-requantize dolphin-2.9.4-llama3.1-8b-Q4_0.gguf dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf Q4_0_8_8

Run the model

Finally, we can run the model using llama-cli. There are a few arguments we will use:

Model (-m): The optimized model for Graviton3, dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf
Prompt (-p): As a test prompt, we’ll use “Building a visually appealing website can be done in ten simple steps”
Response length (-n): We’ll ask for 512 characters
Thread count (-t): We want to use all 64 of the vCPUs

Here’s the command:

./llama-cli -m dolphin-2.9.4-llama3.1-8b-Q4_0_8_8.gguf -p "Building a visually appealing website can be done in ten simple steps:" -n 512 -t 64

When you run the command, you should see several parameters print out followed by the generated text (starting with the prompt) and finally performance statistics:

Evaluate Performance

The two lines highlighted above are the prompt evaluation time and the text generation time. These are two of the key metrics for user experience with LLMs. The prompt evaluation time relates to how long it takes the LLM to process the prompt and start to respond. The text generation time is how long it takes to generate the output. In both cases, the metric can be viewed in terms of tokens per second (T/s). For our run we see:

Evaluation: 278.2 T/s
Generation: 47.7 T/s

If you run the standard Q4_0 quantization with everything else the same, as with this command:

./llama-cli -m dolphin-2.9.4-llama3.1-8b-Q4_0.gguf -p "Building a visually appealing website can be done in ten simple steps:" -n 512 -t 64

You will see a decrease in performance:

Evaluation: 164.6 T/s
Generation: 28.1 T/s

Using the correct quantization format (Q4_0_8_8, in this case) you get close to 70% improvement on evaluation and generation!

When you are done with your tests, don’t forget to stop the instance!

Comparing Graviton-based instances

Using the process above, we can run the same model on similarly equipped Graviton2 and Graviton4-based instances. Using the optimum quantization format for each, we can see an increase in performance from generation to generation:

Generation	Instance	Quant	Eval (T/s)	Gen (T/s)
Graviton2	r6g.16xlarge	Q4_0_4_4	175.4	25.1
Graviton3	r7g.16xlarge	Q4_0_8_8	278.2	42.7
Graviton4	r8g.16xlarge	Q4_0_4_8	341.8	65.6

The performance differences are due to vectorization extensions, caching, clock speed, and memory bandwidth. You may see some variation at lower vCPU/thread counts and when using different instance types: general purpose (M), compute optimized (C), etc. Graviton4 also has more cores per chip, with instances available up to 192 vCPUs!

Determining which instances meet your needs depends on your application. For interactive applications, you may want low evaluation latency and a text generation speed of more than 10 Tok/s. Any of the 64 vCPU instances can easily meet the generation requirement, but you may need to consider the expected size of prompts to determine evaluation latency. Graviton2 performance shows that serverless solutions using AWS Lambda may be possible, especially for non-time critical applications.

Get Started!

As you can see, running Meta Llama models on AWS Graviton is straightforward. This is an easy way to test out models for your own applications. In many cases, Graviton may be the most cost-effective way of integrating LLMs with your application. I’ll explore this further in the coming months.

In the meantime, here are some resources to help you get started:

Deploy a Large Language Model (LLM) chatbot on Arm servers

Have fun!

arm Article's

30 articles in total

Understanding the Difference Between x86 and ARM CPUs: Instruction Set Comparison and Their Impact