dev-resources.site
for different kinds of informations.
Testing LLM Speed Across Cloud Providers: Groq, Cerebras, AWS & More
After my previous exploration of local vs cloud GPU performance for LLMs, I wanted to dive deeper into comparing inference speeds across different cloud API providers. With all the buzz around Groq and Cerebras's blazing-fast inference claims, I was curious to see how they stack up in real-world usage.
The Testing Framework
I developed a simple Node.js-based framework to benchmark different LLM providers consistently. The framework:
- Runs a series of standardised prompts across different providers
- Measures inference time and response generation
- Writes results to structured output files
- Supports multiple providers including OpenAI, Anthropic, AWS Bedrock, Groq, and Cerebras
The test prompts were designed to cover different scenarios:
- Mathematical computations (typically challenging for LLMs)
- Long-form text summarisation (high input tokens, lower output)
- Structured output generation (JSON, XML, CSV formats)
Test Results
The complete benchmark results are available in this spreadsheet. While the GitHub repository contains the output from each LLM, we'll focus purely on performance metrics here.
One of the most interesting findings was the significant speed variation for identical models across different providers. This suggests that infrastructure and optimization play a crucial role in inference speed.
The most dramatic differences emerged when testing larger models like Llama 70B. Providers optimized for fast inference showed remarkable capabilities, demonstrating that even models with 70B parameters can achieve impressive speeds with the right infrastructure.
Groq's performance across different model sizes reveals an intriguing pattern: whether running small or large models, inference speeds remain remarkably consistent, suggesting they possibly managed to optimise for bigger models.
Key Findings
- Groq and Cerebras: The hype is real. Both providers demonstrated exceptional performance, particularly with larger models like Llama 3 70B
- Ollama: With a decent GPU (e.g., RTX 4090), smaller models (Llama 3.2 1B/3B) performed (speed-wise) comparably to the quickest "API-based models" like Anthropic's Claude Haiku 3 and Amazon's Nova Micro
- Speed rankings were fairly consistent across different prompts (math, summarisation, structured output)
- API throttling became an issue with larger models on AWS Bedrock (Claude Sonnet 3.5, Opus 3, Nova Pro)
Featured ones: