Logo

dev-resources.site

for different kinds of informations.

Mastering Real-Time AI: A Developer’s Guide to Building Streaming LLMs with FastAPI and Transformers

Published at
12/12/2024
Categories
langchain
fastapi
llm
Author
louis-sanna
Categories
3 categories in total
langchain
open
fastapi
open
llm
open
Author
11 person written this
louis-sanna
open
Mastering Real-Time AI: A Developer’s Guide to Building Streaming LLMs with FastAPI and Transformers

Introduction: Why Real-Time Streaming AI is the Future

Real-time AI is transforming how users experience applications. Gone are the days when users had to wait for entire responses to load. Instead, modern apps stream data in chunks.

For developers, this shift isn't just a "nice-to-have" — it's essential. Chatbots, search engines, and AI-powered customer support apps are now expected to integrate streaming LLM (Large Language Model) responses. But how do you actually build one?

This guide walks you through the process, step-by-step, using FastAPI, Transformers, and a healthy dose of asynchronous programming. By the end, you'll have a working streaming endpoint capable of serving LLM-generated text in real-time.

💡 Who This Is For:

  • Software Engineers who want to upgrade their back-end skills with text streaming and event-driven programming.
  • Data Scientists who want to repurpose ML skills for production-ready AI services.

Table of Contents

  1. What Is a Streaming LLM and Why It Matters?
  2. Tech Stack Overview: The Tools You'll Need
  3. Project Walkthrough: Building the Streaming LLM Backend
    • Environment Setup
    • Setting Up FastAPI
    • Building the Streaming Endpoint
    • Connecting the LLM with Transformers
  4. Client-Side Integration: Consuming the Stream
  5. Deploying Your Streaming AI App
  6. Conclusion and Next Steps

1️⃣ What Is a Streaming LLM and Why It Matters?

When you type into ChatGPT or ask a question in Google Bard, you'll notice the response appears one word at a time. Streaming LLMs send chunks of text as they're generated instead of waiting for the entire message, i.e. they deliver in real-time.

Here’s why you should care as a developer:

  • Faster User Feedback: Users see responses sooner.
  • Lower Latency Perception: Users feel like the system is faster, even if total time is the same.
  • Improved UX for AI Chatbots: Streaming text "feels" human, mimicking natural conversation.

If you’ve used ChatGPT, you’ve already experienced this. Now it’s time to learn how to build one yourself.


2️⃣ Tech Stack Overview: The Tools You'll Need

To build your streaming LLM backend, you’ll need the following tools:

📦 Core Technologies

Tool Purpose
FastAPI Handles API requests and real-time streaming
Uvicorn Runs the FastAPI app as an ASGI server
Transformers Access pre-trained language models
asyncio Handles asynchronous event loops
contextvars Keeps track of context in async tasks
Server-Sent Events (SSE) Streams messages to the client
Docker Optional for containerization and deployment

💡 Note: Server-Sent Events (SSE) is different from WebSockets. SSE allows the server to push data to the client, while WebSockets support bi-directional communication. For LLM streaming, SSE is simpler and more efficient.


3️⃣ Project Walkthrough: Building the Streaming LLM Backend

Step 1: Environment Setup

  1. Install Python and Pip: Ensure Python 3.7+ is installed.
  2. Create a Virtual Environment:

    python -m venv venv
    source venv/bin/activate  # On Windows: venv\Scripts\activate
    
    
  3. Install Dependencies:

    pip install fastapi uvicorn transformers asyncio
    
    

Step 2: Set Up FastAPI

Create a file named app.py. Here’s the basic FastAPI setup.

from fastapi import FastAPI, Response

app = FastAPI()

@app.get("/")
async def root():
    return {"message": "Welcome to Real-Time LLM Streaming!"}

Enter fullscreen mode Exit fullscreen mode

Run the server:

uvicorn app:app --reload

Enter fullscreen mode Exit fullscreen mode

Visit http://127.0.0.1:8000/ in your browser. You should see:

{ "message": "Welcome to Real-Time LLM Streaming!" }

Enter fullscreen mode Exit fullscreen mode

Step 3: Build the Streaming Endpoint

Instead of returning a single response, we’ll stream it chunk-by-chunk. Here’s the idea:

  1. The client makes a request to /stream.
  2. The server "yields" parts of the response as they are generated.

Here’s the code for the streaming endpoint:

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
import asyncio

app = FastAPI()

async def event_stream():
    for i in range(10):
        await asyncio.sleep(1)  # Simulate response delay
        yield f"data: Message {i}\n\n"

@app.get("/stream")
async def stream_response():
    return StreamingResponse(event_stream(), media_type="text/event-stream")

Enter fullscreen mode Exit fullscreen mode

🔥 Test It:

Run the server and visit http://127.0.0.1:8000/stream — you'll see "Message 0", "Message 1", etc., appear every second.


Step 4: Connect the LLM with Transformers

Now, let’s swap out the dummy messages for LLM-generated responses.

from fastapi import FastAPI, Request
from fastapi.responses import StreamingResponse
from transformers import pipeline
import asyncio

app = FastAPI()
llm_pipeline = pipeline("text-generation", model="gpt2")

async def generate_response(prompt):
    for chunk in llm_pipeline(prompt, max_length=50, return_full_text=False):
        yield f"data: {chunk['generated_text']}\n\n"
        await asyncio.sleep(0.1)

@app.get("/stream")
async def stream_response(prompt: str):
    return StreamingResponse(generate_response(prompt), media_type="text/event-stream")

Enter fullscreen mode Exit fullscreen mode

🔥 Test It:

Run the server and visit:

http://127.0.0.1:8000/stream?prompt=Once upon a time

Enter fullscreen mode Exit fullscreen mode

You’ll see the AI model stream the response live.


4️⃣ Client-Side Integration: Consuming the Stream

On the front end, you can use EventSource (a native browser API) to consume the stream.

Here’s the simplest way to do it:

<!DOCTYPE html>
<html lang="en">
<body>
  <h1>LLM Streaming Demo</h1>
  <pre id="stream-output"></pre>

  <script>
    const output = document.getElementById('stream-output');
    const eventSource = new EventSource('http://127.0.0.1:8000/stream?prompt=Tell me a story');

    eventSource.onmessage = (event) => {
      output.innerText += event.data + '\n';
    };
  </script>
</body>
</html>

Enter fullscreen mode Exit fullscreen mode

This will display a live feed of the AI response on your webpage.


5️⃣ Deploying Your Streaming AI App

You’ve got it working locally, but now you want to deploy it to the world. Here’s how:

Step 1: Dockerize the App

Create a file called Dockerfile:

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.8

WORKDIR /app
COPY . /app

RUN pip install -r /app/requirements.txt

CMD ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "80"]

Enter fullscreen mode Exit fullscreen mode

Step 2: Build and Run the Docker Image

docker build -t streaming-llm .
docker run -p 80:80 streaming-llm

Enter fullscreen mode Exit fullscreen mode

6️⃣ Conclusion: What’s Next?

Congratulations! 🎉 You’ve built a real-time, streaming LLM from scratch using FastAPI, Transformers, and Server-Sent Events. Here's what you’ve learned:

  • How streaming works (and why it matters).
  • How to use FastAPI for streaming endpoints.
  • How to stream LLM responses with Hugging Face Transformers.

Where to Go Next?

  • Optimize Your LLM: Use Hugging Face models like GPT-J or distilGPT2 for smaller, faster models.
  • Explore WebSockets: For two-way streaming (not just server->client).
  • Deploy to Cloud: Deploy your app to AWS, GCP, or Heroku.

🧠 Pro Tip: Add interactive client-side UI, like a chat interface, to create your own mini ChatGPT!

With this guide, you're ready to level up your developer skills and build interactive, AI-driven experiences. 🚀

Want to learn more about building Responsive LLMs? Check out my course on newline: Responsive LLM Applications with Server-Sent Events

I cover :

  • How to design systems for AI applications
  • How to stream the answer of a Large Language Model
  • Differences between Server-Sent Events and WebSockets
  • Importance of real-time for GenAI UI
  • How asynchronous programming in Python works
  • How to integrate LangChain with FastAPI
  • What problems Retrieval Augmented Generation can solve
  • How to create an AI agent ... and much more.
langchain Article's
30 articles in total
Favicon
Get More Done with LangChain’s AI Email Assistant (EAIA)
Favicon
[Boost]
Favicon
Unlocking AI-Powered Conversations: Building a Retrieval-Augmented Generation (RAG) Chatbot
Favicon
AI Innovations to Watch in 2024: Transforming Everyday Life
Favicon
Calling LangChain from Go (Part 1)
Favicon
LangChain vs. LangGraph
Favicon
Mastering Real-Time AI: A Developer’s Guide to Building Streaming LLMs with FastAPI and Transformers
Favicon
Integrating LangChain with FastAPI for Asynchronous Streaming
Favicon
AI Agents + LangGraph: The Winning Formula for Sales Outreach Automation
Favicon
Building Talk-to-Page: Chat or Talk with Any Website
Favicon
AI Agents: The Future of Intelligent Automation
Favicon
Boost Customer Support: AI Agents, LangGraph, and RAG for Email Automation
Favicon
Using LangChain to Search Your Own PDF Documents
Favicon
Lang Everything: The Missing Guide to LangChain's Ecosystem
Favicon
How to make an AI agent with OpenAI, Langgraph, and MongoDB 💡✨
Favicon
Novita AI API Key with LangChain
Favicon
7 Cutting-Edge AI Frameworks Every Developer Should Master in 2024
Favicon
My 2025 AI Engineer Roadmap List
Favicon
AI Agents Architecture, Actors and Microservices: Let's Try LangGraph Command
Favicon
How to integrate pgvector's Docker image with Langchain?
Favicon
A Practical Guide to Reducing LLM Hallucinations with Sandboxed Code Interpreter
Favicon
LangGraph with LLM and Pinecone Integration. What is LangGraph
Favicon
Choosing a Vector Store for LangChain
Favicon
Roadmap for Gen AI dev in 2025
Favicon
AI-Powered Graph Exploration with LangChain's NLP Capabilities, Question Answer Using Langchain
Favicon
Potenciando Aplicaciones de IA con AWS Bedrock y Streamlit
Favicon
How Spring Boot and LangChain4J Enable Powerful Retrieval-Augmented Generation (RAG)
Favicon
Get Started with LangChain: A Step-by-Step Tutorial for Beginners
Favicon
Building RAG-Powered Applications with LangChain, Pinecone, and OpenAI
Favicon
What is Chunk Size and Chunk Overlap

Featured ones: