Logo

dev-resources.site

for different kinds of informations.

What does LLM Temperature Actually Mean?

Published at
10/28/2024
Categories
ai
aiops
rag
opensource
Author
jackcolquitt
Categories
4 categories in total
ai
open
aiops
open
rag
open
opensource
open
Author
12 person written this
jackcolquitt
open
What does LLM Temperature Actually Mean?

At this point, I thought I knew what temperature means for a LLM. A lower temperature increases determinism, reducing the likelihood of hallucinations or inaccurate responses. Google’s definition echoes this perception:

“The temperature controls the degree of randomness in token selection. The temperature is used for sampling during response generation, which occurs when topP and topK are applied. Lower temperatures are good for prompts that require a more deterministic or less open-ended response, while higher temperatures can lead to more diverse or creative results. A temperature of 0 is deterministic, meaning that the highest probability response is always selected.”

That makes sense - but if temperature is so straightforward, why are my tests with Gemini-1.5-Flash-002 so nonsensical???

We’ve been looking into adding what we’re calling “document-level metadata” in the TrustGraph extraction process. While we did add this feature in release 0.13.2, I had been evaluating using a LLM to extract important entities and topics for the entirety of a text corpus. I normally set the temperature to 0.0 since this should produce the most accurate extraction. I ran an extraction with Gemini-1.5-Flash-002. Looked good - except for one problem, I had accidentally set the temperature to 1.0. I reran it at 0.0, and the results worked worse. What’s going on?

I’ve never run comparison tests with TrustGraph where I did nothing but vary the temperature, but I decided, why not? For a single document, I did 3 runs, varying only the temperature from 0.0, 0.5, 1.0, 1.5, to 2.0. Yes, the temperature of Gemini goes to 2.0. No, I don’t know why. For other parameters, I set top_p=1.0, top_k=40, and output tokens maxed out at 8192 for all runs. I also used a JSON schema object for the response type.

Given my understanding of temperature, I expected Gemini to extract more information, returning more objects as the temperature increased. I would think a more deterministic response would be more conservative in how much information would be extracted. Except that didn’t happen. Except, my hypothesis wasn’t really proven wrong either. In fact, I’m not sure what these results mean.

The first document I tested was the Roger’s Commission Report from the NASA Challenger disaster. That PDF extracts to 176k tokens, 17.6% of Gemini-1.5-Flash’s advertised context window. For each run, here’s the number of output tokens:

table1

The second document was another NASA report on the decision making of the Columbia disaster. That PDF extracts to 24.4k tokens, 2.4% of the advertised context window.

table2

The inconsistency of the first set of test runs is inexplicable. Most times, Gemini tried to extract more than the maximum 8192 tokens, returning an incomplete and invalid JSON object. Yet, what about run 2 when even at a temperature of 0.0 Gemini returned only 1511 tokens? Why did increasing the temperature to 2.0 decrease the output so dramatically? The data is so inconsistent, I don’t know where to begin to draw any conclusions.

The second document data is more consistent. For instance, at a temperature of 0.0, it returned the same amount of tokens all 3 times. When increasing the temperature to 0.5, the responses did increase as I predicted. And then there’s temperature 1.0 where the response amounts go down. Beyond 1.0, the responses mostly go down with one outlier at 2.0 where the responses were 2x.

With this data, can I draw any meaningful conclusions? Yes, I think I can.

  • Long context windows still aren’t reliable. Even at only 17.6% Gemini’s advertised context window, the behavior is shockingly inconsistent.
  • At a much smaller context, the temperature behavior seems to be more consistent, but still a bit mysterious.
  • For knowledge extraction tasks, temperature doesn’t work the way we think it should.

Sure, the consistency of those 3 runs where it returned the same output all 3 times seems great, but what if we want more? For knowledge extraction and graph building in TrustGraph, we’re trying to extract every important detail from the input document. We don’t want just facts, but any meaningful statements or opinions described in the text. It appears allowing the LLM to introduce some randomness in the response tokens produces more objects for information extraction. Bizarrely, I also noticed that increasing the temperature seemed to return more people than at lower temperatures. Based on cursory glances, none of the responses seemed to be producing hallucinations, but that experiment will require more testing.

Download TrustGraph on GitHub

🚀 Get Started

Join the community

aiops Article's
30 articles in total
Favicon
The Future is Now: How AI Consulting Services are Revolutionizing Industries
Favicon
Role of Artificial Intelligence in DevOps
Favicon
The Rise of AIOps: How AI is Transforming IT Operations
Favicon
Debugging and Troubleshooting Generative AI Applications
Favicon
MiniProject — Detect Faces by Using AWS Rekognition!
Favicon
AIOps Powered by AWS: Developing Intelligent Alerting with CloudWatch & Built-In Capabilities
Favicon
Why Rust is the Future of AI and ML Ops
Favicon
How-to Use AI to See Your Data in 3D
Favicon
The Future of DevOps: How AI is Shaping Infrastructure Management
Favicon
AI Ethics | Navigating the Future with Responsibility
Favicon
A Beginner’s Guide To Artificial Intelligence & Its Key Concepts
Favicon
Maximizing AI Agents for Seamless DevOps and Cloud Success
Favicon
Running Phi 3 with vLLM and Ray Serve
Favicon
Primer on Distributed Parallel Processing with Ray using KubeRay
Favicon
Monitoring and Improving AI Model Performance with Handit.AI
Favicon
AI Model Monitoring and Continuous Improvement: A Comprehensive Guide
Favicon
Amazon DevOps Guru for the Serverless applications - Part 14 my wish and improvement list
Favicon
Talk to Your Cloud: Effortless AI-Driven Deployments
Favicon
Amazon DevOps Guru for the Serverless applications - Part 13 Anomaly detection on Aurora Serverless v2 with Data API (kind of)
Favicon
СontextCheck: LLM & RAG Evaluation Framework
Favicon
How to Develop an AI Application: Step-by-Step using Orkes Conductor
Favicon
5 Key takeaways from Gartner AIOps Report
Favicon
Design and Implementation of LLM-based Intelligent O&M Agent System
Favicon
Specialized Domain Models: Unlocking the Power of Tailored AI Solutions
Favicon
The Future of Agentic Systems Podcast
Favicon
Top AI Solutions for Financial Services in 2025
Favicon
Supercharging GitHub Project Management: Building an Intelligent Issue Bot with Cross-Namespace Configuration Support
Favicon
BigPanda
Favicon
What does LLM Temperature Actually Mean?
Favicon
Building Resilient GenAI pipeline with Open-source AI Gateway

Featured ones: