Logo

dev-resources.site

for different kinds of informations.

The Era of LLM Infrastructure

Published at
2/15/2024
Categories
chatgpt
genai
llms
llmops
Author
roma_glushko
Categories
4 categories in total
chatgpt
open
genai
open
llms
open
llmops
open
Author
12 person written this
roma_glushko
open
The Era of LLM Infrastructure

API access to large language models has opened up a world of opportunities. We have seen many simple proof-of-concept applications show promise in being effective. However, as the complexity of these applications grows, several crucial issues arise when putting these systems into production. These issues include unreliable API endpoints, slow token generation, LLM lock-in, and cost management. Clearly, the LLM era will require solutions to manage LLM API endpoints.

Glide is a cloud-native LLM gateway that provides a lightweight interface to manage the complexity of working with multiple LLM providers.

The new way of building GenAI apps

Unified API

Glide offers a comprehensive API that facilitates interaction with multiple LLM providers. Instead of dedicating considerable time and resources to developing custom integrations for individual LLM providers, Glide provides a single API interface that allows users to interact with any LLM provider. Adopting this approach can significantly enhance application development efficiency. By working off a standardized API, engineers can minimize complexity and development time, leading to faster and more efficient application development. Additionally, there is zero LLM model lock-in, as underlying models can be switched without knowledge from the client application.

Glide Routers

A fundamental principle in Glide is the concept of routers. Routers enable you to group models together for shared logic. An excellent example of this is illustrated by a RAG power chatbot, which allows users to search over a documentation set. It is directly built on GPT-3.5 Turbo and entirely depends on OpenAI to keep its API operable. This dependency poses a significant risk to the application and user experience. Therefore, it is recommended to set up a Glide router in resilience mode by adding a single backup model to a router. If the OpenAI API fails, Glide will automatically send the API call to the next model specified in the configuration. In addition, model failure knowledge is shared across all routers, reducing wasteful retries when an LLM provider has a known issue.

Another essential router type is the least-latency router. This router selects the model with the lowest average latency per generated token. Since we don’t know the actual distribution of model latencies, we attempt to estimate it and keep it updated over time. Over time, old latency data is weighted lower and eventually dropped from the calculation. This ensures latencies are constantly updated. As with all routers, if a model becomes unhealthy, it will pick the second-best, etc.

Other routing modes are available, such as round-robin, which is excellent for A/B testing, and weighted round-robin, which helps specify the percentage of traffic that should be sent to a set of models.

One Glide deployment can support multiple applications with diverse requirements since it can support numerous routers. There are also exciting routers on the roadmap, such as intelligent routing, which ensures your request is sent to the model best suited for that request.

Declarative Configuration

Glide simplifies the setup process through declarative configuration, which defines the state of the Glide gateway in one place. This also means that secret management is centralized, enabling the rotation of API keys from a single location.

Furthermore, this approach enables the separation of responsibilities between teams. One team can manage the infrastructure, deploy Glide, and make it available to other teams (such as AI/DS teams) while also being responsible for rotating keys. Meanwhile, other teams can solely focus on working with models and not worry about these configurations.

Here is a bare-bones configuration example:

routers:
  language:
    - id: my-chat-app
      strategy: priority
      models:
        - id: primary
          openai:
            model: "gpt-3.5-turbo"
            api_key: ${env:OPENAI_API_KEY}
        - id: secondary
          azureopenai:
            api_key: ${env:AZUREOAI_API_KEY}
            model: "glide-GPT-35" # the Azure OpenAI deployment name
            base_url: "https://mydeployment.openai.azure.com/"
Enter fullscreen mode Exit fullscreen mode

With this simple configuration a priority/fallback router has been created. All requests will be sent to OpenAI, should the OpenAI API fail the request will be sent to an Azure OpenAI deployment.

What's Next?

The future of LLM applications will be multi-modal, with text, speech, and vision models employed together to create rich user experiences. Glide will be the go-to gateway for these applications. Glide plans to support various features over the next several months, including exact and semantic caching, embedding endpoints, speech endpoints, safety policies, and monitoring features.


If you are interested in using Glide, here is a list of links for you to check out:

llms Article's
30 articles in total
Favicon
How are LLMs Transforming Search Algorithms, and How Can You Adapt Your SEO Strategy?
Favicon
Some thoughts on when to use LLMs
Favicon
Alignment Faking in Large Language Models: Could AI Be Deceiving Us?
Favicon
Alibaba Researchers Introduce MARCO-O1: A Leap Forward in LLM Reasoning Capabilities
Favicon
Understanding Large Language Models (LLMs)
Favicon
Chat with Docs Using OpenAI and a Serverless RAG Tool
Favicon
LLM Models and RAG Applications Step-by-Step - Part I - Introduction
Favicon
LLM Models and RAG Applications Step-by-Step - Part II - Creating the Context
Favicon
What’s so BIG about Small Language Models?
Favicon
The LLAMA run
Favicon
Build A Rag Chatbot with OpenAI and Langchain
Favicon
Alibaba Cloud Customer Service Agent: Thoughts, Design, and Practices from Concept to Implementation
Favicon
My first post
Favicon
How does ChatGPT generate human-like text?
Favicon
Large Language Models(LLMs)
Favicon
LangChain - A Framework for LLM-Powered Applications
Favicon
How to power up LLMS with Web Scraping and RAG
Favicon
RSA Conference 2024: AI and the Future Of Security
Favicon
Inductor Custom Playgrounds: A developer-first way to experiment and collaborate on LLM app development
Favicon
BSides Knoxville 2024: A Community Celebrating A Decade of Cybersecurity
Favicon
Rising Like A Phoenix, ShowMeCon 2024 Resurrects A Security Community In The Midwest
Favicon
LLM Function Calling - The same... But different
Favicon
OpenAI and .NET tutorial with Semantic Kernel - Part 1
Favicon
How to Evaluate RAG Applications
Favicon
LLMs on your local Computer (Part 2)
Favicon
Inductor is live on Product Hunt πŸš€
Favicon
Training LLMs Taking Too Much Time? Technique you need to know to train it faster
Favicon
Large Language Models: Compairing Gen2/Gen3 Models (Bloom, Gopher, OPT and More)
Favicon
Introducing Inductor: Ship Production-Ready LLM Apps Dramatically Faster and More Easily
Favicon
The Era of LLM Infrastructure

Featured ones: