How to Access Llama 3.3 70b Locally or via API: A Complete Guide

Published at

1/2/2025

Key Highlights

1.Advanced Performance: Llama 3.3 70b is a powerful model from Meta. It excels in tasks such as instruction following and multilingual reasoning.

2.How to access llama 3.3 70b locally: To run Llama 3.3 70b locally, you'll need a powerful GPU (minimum 24GB VRAM), at least 32GB of RAM, and 250GB of storage, along with specific software.

3.How to access llama 3.3 70b via API: Novita AI offers an API for Llama 3.3 70b, at just $0.39 per million tokens for both input and output. Just sign up for a free trial and use the API with simple requests.

4.Usage Recommendations: Different users have varying needs: researchers may prefer local installation, while businesses and casual users might find API access more convenient and cost-effective.

What is Llama 3.3 70b?

Llama 3.3 70b is Meta's latest multilingual large language model (LLM) designed for various text-based tasks. With 70 billion parameters, it offers comparable performance to the much larger Llama 3.1 405B model while significantly reducing computational requirements, making it more accessible for developers.

Key Features

Multilingual Support: Llama 3.3 70b natively supports eight languages: English, French, German, Hindi, Italian, Portuguese, Spanish, and Thai. It can also be fine-tuned for additional languages with proper safeguards.
Advanced Architecture: Utilizes an optimized transformer architecture with Grouped-Query Attention (GQA) to enhance efficiency and scalability.
Long Context Length: Supports a context length of 128k tokens, suitable for processing lengthy texts.
Eco-Friendly Training: Meta achieved net-zero emissions during the model's training process.
Tool Integration: Allows integration with external tools and APIs for real-time data access and third-party applications.
Safety and Alignment: Fine-tuned with supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to ensure safety and alignment with human preferences.

https://youtu.be/-dnGa6Oms5I

Compared with Other Llama Models

Llama 3.3 70b vs. Llama 3.1 405B: Llama 3.3 70b offers similar performance to Llama 3.1 405B but with improved efficiency and lower computational demand.
Llama 3.3 70b vs. Llama 3.2: Llama 3.3 enhances fine-tuning, safety features, and benchmark performance over Llama 3.2.

Compared with Other Models

While Llama 3.3 70b may not always outperform models like GPT-4 or Claude 3.5, it provides competitive results, particularly in coding and multilingual reasoning. It excels in instruction-following tasks, outperforming both Llama 3.1 405B and GPT-4 in this area. Additionally, it is more cost-effective than models such as Amazon Nova Pro, GPT-4, and Claude 3.5, in terms of input and output token expenses.

If you want to see a more detailed parameter comparison, you can check out this article:Llama 3.3 Benchmark: Key Advantages and Application Insights

Applications

Multilingual chatbots and virtual assistants.
Coding support and software development.
Synthetic data generation.
Multilingual content creation and localization.
Research and experimentation.
Knowledge-based applications such as question answering and summarization.

How to Access Llama 3.3 70b Locally

Hardware Requirements and Configuration Recommendations

GPU: NVIDIA GPU with a minimum of 24GB VRAM (e.g., A100 or H100). Some sources recommend an NVIDIA RTX A6000 with 48GB.
RAM: At least 32GB (64GB recommended for larger datasets).
Storage: Minimum 250GB of free disk space; the model itself may occupy around 40GB.
Operating System: Linux (preferred) or Windows with WSL2, with Ubuntu 22.04 being a specific option.
Software: Python 3.8 or newer and CUDA Toolkit 11.7 or higher.
Required Libraries: Hugging Face Transformers, PyTorch, and tools for quantization and optimization like bitsandbytes.

From the above data you can find Why LLaMA 3.3 70B VRAM Requirements Are a Challenge for Home Servers？

Step-by-Step Installation Guide

Install Python and create a virtual environment.
Install required libraries: Use pip install bitsandbytes for GPU optimization.
Install the Hugging Face CLI and log in:

   pip install huggingface-cli
   huggingface-cli login

Request access to Llama-3.3 70b on the Hugging Face website.
Download the model files using the Hugging Face CLI:

   huggingface-cli download meta-llama/Llama-3.3-70B-Instruct --include "original/*" --local-dir Llama-3.3-70B-Instruct

Load the model locally using the Hugging Face Transformers library:

   import torch
   from transformers import AutoModelForCausalLM, AutoTokenizer

   model_id = "meta-llama/Llama-3.3-70B-Instruct"
   model = AutoModelForCausalLM.from_pretrained(
       model_id, device_map="auto", torch_dtype=torch.bfloat16
   )
   tokenizer = AutoTokenizer.from_pretrained(model_id)

Run inference using the loaded model and tokenizer.

How to Access Llama 3.3 70b via Novita AI

Step-by-Step Guide

Novita AI offers an affordable, reliable, and simple inference platform with scalable Llama 3.3 70b API, empowering developers to build AI applications. Try the Novita AI Llama 3.3 70b API Demo today!

Step1: Log in and Start Free Trail !

you can find LLM Playground page of Novita AI for a free trial! This is the test page we provide specifically for developers! Select the model from the list that you desired. Here you can choose the Llama 3.3 70b model.

Step2: If the trial goes well, you can start calling the API！

Click the “API Key" under the menu. To authenticate with the API, we will provide you with a new API key. Entering the “Keys“ page, you can copy the API key as indicated in the image.

Navigate to API and find the “LLM” under the “Playground” tab. Install the Novita AI API using the package manager specific to your programming language.

Step3: Begin interacting with the model！

After installation, import the necessary libraries into your development environment. Initialize the API with your API key to start interacting with Novita AI LLM. This is an example of using chat completions API.

 from openai import OpenAI

client = OpenAI(
    base_url="https://api.novita.ai/v3/openai",
    # Get the Novita AI API Key by referring to: https://novita.ai/docs/get-started/quickstart.html#_2-manage-api-key.
    api_key="<YOUR Novita AI API Key>",
)

model = "meta-llama/llama-3.3-70b-instruct"
stream = True  # or False
max_tokens = 512

chat_completion_res = client.chat.completions.create(
    model=model,
    messages=[
        {
            "role": "system",
            "content": "Act like you are a helpful assistant.",
        },
        {
            "role": "user",
            "content": "Hi there!",
        }
    ],
    stream=stream,
    max_tokens=max_tokens,
)

if stream:
    for chunk in chat_completion_res:
        print(chunk.choices[0].delta.content or "")
else:
    print(chat_completion_res.choices[0].message.content)

Upon registration, Novita AI provides a $0.5 credit to get you started!

If the free credits is used up, you can pay to continue using it.

Which Methods Are Suitable for You?

Comparison of Local vs. API Access

Aspect	Local Access	API Access
Scalability	Limited; requires manual upgrades.	Scales automatically and efficiently.
Flexibility	High flexibility; full control over settings.	Less flexible; depends on provider’s configurations.
Usability	Requires technical expertise.	Easier to use, no complex setup needed.
Affordability	High initial cost, low ongoing costs. Best for long-term use.	Pay-per-use, ideal for small-scale or occasional use.

Recommendations for Different User Groups

Researchers: Local access is generally preferred for flexibility and control over experiments.
Developers:
- API access is suitable for building applications and rapid prototyping.
- Local access is better for fine-tuning and custom workflows.
Businesses: API access is beneficial for quick integration into services without high upfront costs. Local deployment may suit teams with consistent requirements and the ability to invest in infrastructure.
Small Teams/Individuals: API access is generally more practical due to lower startup costs.
Users with Limited Technical Skills: API access is preferable as it eliminates the need for deep technical knowledge.

In conclusion, Llama 3.3 is a powerful, versatile, and accessible model that balances performance and resource requirements. Depending on your needs and available resources, you can choose to run it locally or access it via the API.

Frequently Asked Questions

1.Is Llama 3.3 70b free

Llama 3.3 is considered free to use, as it is released by Meta as an open-source model, meaning you can download and utilize it without any direct cost; however, depending on how you access it through third-party services, there might be associated fees.

2.What is the latest version of LLaMA?

The latest version is Llama 3.3, released in December 2024. Llama models are trained at different parameter sizes, ranging between 1B and 405B. Originally, Llama was only available as a foundation model.

originally from Novita AI

Novita AI is the All-in-one cloud platform that empowers your AI ambitions. Integrated APIs, serverless, GPU Instance — the cost-effective tools you need. Eliminate infrastructure, start free, and make your AI vision a reality.

dev-resources.site