Logo

dev-resources.site

for different kinds of informations.

A Practical Guide to Reducing LLM Hallucinations with Sandboxed Code Interpreter

Published at
12/21/2024
Categories
ai
langchain
sandbox
openai
Author
dbolotov
Categories
4 categories in total
ai
open
langchain
open
sandbox
open
openai
open
Author
8 person written this
dbolotov
open
A Practical Guide to Reducing LLM Hallucinations with Sandboxed Code Interpreter

Most LLMs and SMLs are not designed for calulations (not talking about OpenAI o1 or o3 models). Just imagine the following dialogue:

  • Company: Today is Wednesday; you can return the delivery parcel within 24 hours.
  • Client: Okay, let's do it on Tuesday.

Are you sure the next AI response will be correct? As a human, you can understand that next Tuesday is six days ahead, while 24 hours is just one day. However, most LLMs cannot reliably handle such logic. Their responses are non-deterministic.

This issue worsens as the context grows. If you have 30 rules and a conversation history of 30 messages, the AI loses focus and makes mistakes easily.

Common Use-Case

  • You're developing an AI scheduling chatbot or AI agent for your company.
  • The company has scheduling rules that are frequently updated.
  • Before scheduling, the chatbot must validate customer input parameters.
  • If validation fails, the chatbot must inform the customer.

What Can We Do?

Combine traditional code execution with LLMs. This idea is not new but remains underutilized:

  • OpenAI integrates this feature into its Assistant API, but not in Complitions API.
  • Google recently introduced code interpreter capabilities in Gemini 2.0 Flash.

Image description

Our Solution Tech Stack

  • Docker (Podman)
  • LangGraph.js
  • Piston

Code Interpreter Sandbox

To securely run generated code, the most popular cloud code interpreters are e2b, Google, and OpenAI as I mentioned before.

However, I was looking for an open-source, self-hosted solution for flexibility and cost-effectiveness. So, 2 good options:

  • Piston
  • Jupyter

I chose Piston for its ease of deployment.


Piston Installation

It took me a while to understand how to add python execution environment to Piston.

0. Enable cgroup v2

For Windows WSL, this article was helpful.

1. Run a Container

docker run --privileged -p 2000:2000 -v d:\piston:'/piston' --name piston_api ghcr.io/engineer-man/piston
Enter fullscreen mode Exit fullscreen mode

2. Checkout the Piston Repository

git clone https://github.com/engineer-man/piston
Enter fullscreen mode Exit fullscreen mode

3. Add Python Support

Run the following command:

node cli/index.js ppman install python
Enter fullscreen mode Exit fullscreen mode

By default, this command uses your container API running on localhost:2000 to install Python.

Example Code Execution

Using the Piston Node.js Client:

import piston from "piston-client";

const codeInterpreter = piston({ server: "http://localhost:2000" });

const result = await codeInterpreter.execute('python', 'print("Hello World!")');

console.log(result);
Enter fullscreen mode Exit fullscreen mode

AI Agents Implementation

Source code on GitHub

We're going to use some advanced techniques:

  • Graph and subgraph architecture
  • Parallel node execution
  • Qdrant for storage
  • Observability via LangSmith
  • GPT-4o-mini, a cost-efficient LLM

Refer to the LangSmith trace for a detailed overview of the flow:
https://smith.langchain.com/public/b3a64491-b4e1-423d-9802-06fcf79339d2/r

Step 1: Extract datetime-related scheduling parameters from user input

Example: "Tomorrow, last Friday, in 2 hours, at noon time."
We use code interpreter to ensure reliable extraction, as LLMs can fail even with current date-time contextual information.

Example Prompt for Python Code Generation:

Your task is to transform natural language text into Python code that extracts datetime-related scheduling parameters from user input.  

## Instructions:  
- You are allowed to use only the "datetime" and "calendar" libraries.  
- You can define additional private helper methods to improve code readability and modularize validation logic.  
- Do not include any import statements in the output.  
- Assume all input timestamps are provided in the GMT+8 timezone. Adjust calculations accordingly.  
- The output should be a single method definition with the following characteristics:  
  - Method name: \`getCustomerSchedulingParameters\`  
  - Arguments: None  
  - Return: A JSON object with the keys:  
    - \`appointment_date\`: The day of the month (integer or \`None\`).  
    - \`appointment_month\`: The month of the year (integer or \`None\`).  
    - \`appointment_year\`: The year (integer or \`None\`).  
    - \`appointment_time_hour\`: The hour of the day in 24-hour format (integer or \`None\`).  
    - \`appointment_time_minute\`: The minute of the hour (integer or \`None\`).  
    - \`duration_hours\`: The duration of the appointment in hours (float or \`None\`).  
    - \`frequency\`: The recurrence of the appointment. Can be \`"Adhoc"\`, \`"Daily"\`, \`"Weekly"\`, or \`"Monthly"\` (string or \`None\`).  

- If a specific value is not found in the text, return \`None\` for that field.  
- Focus only on extracting values explicitly mentioned in the input text; do not make assumptions.  
- Do not include print statements or logging in the output.  

## Example:  

### Input:  
"I want to book an appointment for next Monday at 2pm for 2.5 hours."  

### Output:  
def getCustomerSchedulingParameters():  
    """Extracts and returns scheduling parameters from user input in GMT+8 timezone.  

    Returns:  
        A JSON object with the required scheduling parameters.  
    """  
    def _get_next_monday():  
        """Helper function to calculate the date of the next Monday."""  
        current_time = datetime.utcnow() + timedelta(hours=8)  # Adjust to GMT+8  
        today = current_time.date()  
        days_until_monday = (7 - today.weekday() + 0) % 7  # Monday is 0  
        return today + timedelta(days=days_until_monday)  

    next_monday = _get_next_monday()  
    return {  
        "appointment_date": next_monday.day,  
        "appointment_month": next_monday.month,  
        "appointment_year": next_monday.year,  
        "appointment_time_hour": 14,  
        "appointment_time_minute": 0,  
        "duration_hours": 2.5,  
        "frequency": "Adhoc"  
    }

### Notes:
Ensure the output is plain Python code without any formatting or additional explanations.
Enter fullscreen mode Exit fullscreen mode

Step 2: Fetch Rules from Storage

And then transform them into Python code for validation.

Image description

Step 3: Run Generated Code in Sandbox:

const pythonCodeToInvoke = `
import sys
import datetime
import calendar
import json

${state.pythonValidationMethod}

${state.pythonParametersExtractionMethod}

parameters = getCustomerSchedulingParameters()

valiation_errors = validateCustomerSchedulingParameters(parameters["appointment_year"], parameters["appointment_month"], parameters["appointment_date"], parameters["appointment_time_hour"], parameters["appointment_time_minute"], parameters["duration_hours"], parameters["frequency"])

print(json.dumps({"validation_errors": valiation_errors}))`;

    const traceableCodeInterpreterFunction = await traceable((pythonCodeToInvoke: string) => codeInterpreter.execute('python', pythonCodeToInvoke, { args: [] }));
    const result = await traceableCodeInterpreterFunction(pythonCodeToInvoke);
Enter fullscreen mode Exit fullscreen mode

Image description

Source code on GitHub


Potential Improvements

  • Implement an iterative loop for LLMs to debug and refine Python code execution dynamically.
  • Human in the loop for validation method code generation.
  • Caching generated code.

Final Thoughts

Bytecode execution and token-based LLMs are highly complementary technologies, unlocking a new level of flexibility. This synergistic approach has a bright future, for example AWS's recent "Bedrock Automated Reasoning", which appears to offer a similar solution within their enterprise ecosystem. Google and Microsoft also will show us something similar very soon.

langchain Article's
30 articles in total
Favicon
Get More Done with LangChain’s AI Email Assistant (EAIA)
Favicon
[Boost]
Favicon
Unlocking AI-Powered Conversations: Building a Retrieval-Augmented Generation (RAG) Chatbot
Favicon
AI Innovations to Watch in 2024: Transforming Everyday Life
Favicon
Calling LangChain from Go (Part 1)
Favicon
LangChain vs. LangGraph
Favicon
Mastering Real-Time AI: A Developer’s Guide to Building Streaming LLMs with FastAPI and Transformers
Favicon
Integrating LangChain with FastAPI for Asynchronous Streaming
Favicon
AI Agents + LangGraph: The Winning Formula for Sales Outreach Automation
Favicon
Building Talk-to-Page: Chat or Talk with Any Website
Favicon
AI Agents: The Future of Intelligent Automation
Favicon
Boost Customer Support: AI Agents, LangGraph, and RAG for Email Automation
Favicon
Using LangChain to Search Your Own PDF Documents
Favicon
Lang Everything: The Missing Guide to LangChain's Ecosystem
Favicon
How to make an AI agent with OpenAI, Langgraph, and MongoDB 💡✨
Favicon
Novita AI API Key with LangChain
Favicon
7 Cutting-Edge AI Frameworks Every Developer Should Master in 2024
Favicon
My 2025 AI Engineer Roadmap List
Favicon
AI Agents Architecture, Actors and Microservices: Let's Try LangGraph Command
Favicon
How to integrate pgvector's Docker image with Langchain?
Favicon
A Practical Guide to Reducing LLM Hallucinations with Sandboxed Code Interpreter
Favicon
LangGraph with LLM and Pinecone Integration. What is LangGraph
Favicon
Choosing a Vector Store for LangChain
Favicon
Roadmap for Gen AI dev in 2025
Favicon
AI-Powered Graph Exploration with LangChain's NLP Capabilities, Question Answer Using Langchain
Favicon
Potenciando Aplicaciones de IA con AWS Bedrock y Streamlit
Favicon
How Spring Boot and LangChain4J Enable Powerful Retrieval-Augmented Generation (RAG)
Favicon
Get Started with LangChain: A Step-by-Step Tutorial for Beginners
Favicon
Building RAG-Powered Applications with LangChain, Pinecone, and OpenAI
Favicon
What is Chunk Size and Chunk Overlap

Featured ones: