Logo

dev-resources.site

for different kinds of informations.

LLM Inference using 100% Modern Java ☕️🔥

Published at
10/21/2024
Categories
java
llm
llama3
Author
stephanj
Categories
3 categories in total
java
open
llm
open
llama3
open
Author
8 person written this
stephanj
open
LLM Inference using 100% Modern Java ☕️🔥

In the rapidly evolving world of (Gen)AI, Java developers now have powerful new (LLM Inference) tools at their disposal: Llama3.java and JLama.

These projects brings the capabilities of large language models (LLMs) to the Java ecosystem, offering an exciting opportunity for developers to integrate advanced language processing into their applications.

Here's an example of Llama3.java providing inference for the DevoxxGenie IDEA plugin.

The JLama Project

JLama (a 100% Java inference engine) is developed by Jake Luciani and supports a whole range of LLM's :

  • Gemma & Gemma 2 Models
  • Llama & Llama2 & Llama3 Models
  • Mistral & Mixtral Models
  • Qwen2 Models
  • GPT-2 Models
  • BERT Models
  • BPE Tokenizers
  • WordPiece Tokenizers

Here's his Devoxx Belgium 2024 presentation with more information and demo's.

From a features perspective this is the most advanced Java implementation currently available. He even supports LLM sharding on layers and head attention level 🤩

Features includes:

  • Paged Attention
  • Mixture of Experts
  • Tool Calling
  • Generate Embeddings
  • Classifier Support
  • Huggingface SafeTensors model and tokenizer format
  • Support for F32, F16, BF16 types
  • Support for Q8, Q4 model quantization
  • Fast GEMM operations
  • Distributed Inference!

JLama requires Java 20 or later and utilises the new Vector API for faster inference.

You can easily run JLama on your computer, on Apple Silicon make sure you have an ARM based SDK.

export JAVA_HOME=/Library/Java/JavaVirtualMachines/liberica-jdk-21.jdk/Contents/Home
Enter fullscreen mode Exit fullscreen mode

Now you can start JLama with the restapi param and the optional auto-download to start the inference service.

jlama restapi tjake/TinyLlama-1.1B-Chat-v1.0-Jlama-Q4 --auto-download
Enter fullscreen mode Exit fullscreen mode

This will download the model if you haven't already.

Experimental JLama and DevoxxGenie integration

Alina and Alfonso at Devoxx Belgium 2024

The JLama3.java Project

The Llama3.java is also a 100% Java implementation developed by Alfonso² Peterssen and inspired by Andrej Karpathy.

Features includes:

  • Single file, no dependencies
  • GGUF format parser
  • Llama 3 tokenizer based on minbpe
  • Llama 3 inference with Grouped-Query Attention
  • Support Llama 3.1 (ad-hoc RoPE scaling) and 3.2 (tie word embeddings)
  • Support for Q8_0 and Q4_0 quantizations
  • Fast matrix-vector multiplication routines for quantized tensors using Java's Vector API
  • Simple CLI with --chat and --instruct modes.
  • GraalVM's Native Image support (EA builds here)
  • AOT model pre-loading for instant time-to-first-token

Here's the Devoxx Belgium 2024 presentation by Alfonso and Alina.

Llama3.java + (OpenAI) REST API

The Llama3.java doesn't have any REST interface so I decided to contribute that part ❤️

I've added a Spring Boot wrapper around the core Llama3.java library, allowing developers to easily set up and run an OpenAI-compatible REST API for text generation and chat completions. The goal is to use this as the 100% Java inference engine for the IDEA DevoxxGenie plugin. Allowing local inference using a complete Java solution.

Code is available on GitHub

For the time being I've copied the Llama3.java source code into my project but ideally this should be integrated as a Maven dependency.

Key Features

  1. OpenAI-compatible API: The project implements an API that mimics OpenAI's chat completions endpoint, making it easy to integrate with existing applications.
  2. Support for GGUF Models: Llama3.java can work with GGUF (GPT-Generated Unified Format) models, which are optimised for efficiency and performance.
  3. Vector API Utilization: The project leverages Java's incubator Vector API for improved performance on matrix operations.
  4. Cross-Platform Compatibility: While optimized for Apple Silicon (M1/M2/M3), the project can run on various platforms with the appropriate Java SDK.

Getting Started

To get started with Llama3.java, follow these steps:

  1. Setup: Ensure you have a compatible Java SDK installed. For Apple Silicon users, an ARM-compliant SDK is recommended.
  2. Build: Use Maven to build the project with "mvn clean package".
  3. Download a Model: Obtain a GGUF model from the Hugging Face model hub and place it in the 'models' directory.
  4. Configure: Update the application.properties file with your model details and server settings.
  5. Run: Start the Spring Boot application using the provided Java command.

DevoxxGenie

When the Llama3.java Spring Boot application is running, you can use DevoxxGenie for local inference 🤩

DevoxxGenie

Future Directions

The next step is to move the MatMul bottleneck to the GPU using TornadoVM. Also once GraalVM supports

  • Externalise Llama3.java as a maven service dependency (if/when available)
  • Add GPU support using TornadoVM
  • GraalVM native versions 🍏
  • LLM sharding capabilities
  • Support for different models: BitNets & Ternary Models

Conclusion

Llama3.java and JLama represents a significant step forward in bringing large language model capabilities to the Java ecosystem. By providing an easy-to-use, OpenAI-compatible API and leveraging Java's latest performance features, this project opens up new possibilities for AI-driven applications in Java.

Whether you're building a chatbot, a content generation tool, or any application that could benefit from advanced language processing, Llama3.java and JLama offers a promising solution.

As these projects continues to evolve and optimise, it's well worth keeping an eye on for Java developers interested in the cutting edge of AI technology.

Exciting times for Java Developers! ☕️🔥❤️

~ Stephan Janssen

llama3 Article's
30 articles in total
Favicon
Novita AI API on gptel: Supercharge Emacs with LLMs
Favicon
How to Effectively Fine-Tune Llama 3 for Optimal Results?
Favicon
L3 8B Lunaris: Generalist Roleplay Model Merges on Llama-3
Favicon
Accessing Novita AI API through Portkey AI Gateway: A Comprehensive Guide
Favicon
Llama 3 vs Qwen 2: The Best Open Source AI Models of 2024
Favicon
Llama 3.3 vs GPT-4o: Choosing the Right Model
Favicon
Meta's Llama 3.3 70B Instruct: Powering AI Innovation on Novita AI
Favicon
MINDcraft: Unleashing Novita AI LLM API in Minecraft
Favicon
How to Access Llama 3.2: Streamlining Your AI Development Process
Favicon
Are Llama 3.1 Free? A Comprehensive Guide for Developers
Favicon
How Much RAM Memory Does Llama 3.1 70B Use?
Favicon
How to Install Llama-3.3 70B Instruct Locally?
Favicon
Arcee.ai Llama-3.1-SuperNova-Lite is officially the 8-billion parameter model
Favicon
LLM Inference using 100% Modern Java ☕️🔥
Favicon
Enhance Your Projects with Llama 3.1 API Integration
Favicon
Llama 3.2 Running Locally in VSCode: How to Set It Up with CodeGPT and Ollama
Favicon
Llama 3.2 is Revolutionizing AI for Edge and Mobile Devices
Favicon
Two new models: Arcee-Spark and Arcee-Agent
Favicon
How to deploy Llama 3.1 405B in the Cloud?
Favicon
ChatPDFLocal: Chat with Your PDFs Offline with Llama3.1 locally,privately and safely.
Favicon
How to deploy Llama 3.1 in the Cloud: A Comprehensive Guide
Favicon
How to fine tune a model which is available in ollama
Favicon
Theoretical Limits and Scalability of Extra-LLMs: Do You Need Llama 405B
Favicon
Milvus Adventures July 29, 2024
Favicon
Lightning-Fast Code Assistant with Groq in VSCode
Favicon
Journey towards self hosted AI code completion
Favicon
Blossoming Intelligence: How to Run Spring AI Locally with Ollama
Favicon
Setup REST-API service of AI by using Local LLMs with Ollama
Favicon
Hindi-Language AI Chatbot for Enterprises Using Qdrant, MLFlow, and LangChain
Favicon
#SemanticKernel: Local LLMs Unleashed on #RaspberryPi 5

Featured ones: