Logo

dev-resources.site

for different kinds of informations.

Elon Musk agrees that we’ve exhausted AI training data

Published at
1/9/2025
Categories
ai
news
data
discuss
Author
aniruddhaadak
Categories
4 categories in total
ai
open
news
open
data
open
discuss
open
Author
13 person written this
aniruddhaadak
open
Elon Musk agrees that we’ve exhausted AI training data

Elon Musk concurs with other AI experts that there’s little real-world data left to train AI models on.

“We’ve now exhausted basically the cumulative sum of human knowledge …. in AI training,” Musk said during a livestreamed conversation with Stagwell chairman Mark Penn streamed on X late Wednesday. “That happened basically last year.”

Musk, who owns AI company xAI, echoed themes former OpenAI chief scientist Ilya Sutskever touched on at NeurIPS, the machine learning conference, during an address in December. Sutskever, who said the AI industry had reached what he called “peak data,” predicted a lack of training data will force a shift away from the way models are developed today.

Indeed, Musk suggested that synthetic data — data generated by AI models themselves — is the path forward. “The only way to supplement [real-world data] is with synthetic data, where the AI creates [training data],” he said. “With synthetic data … [AI] will sort of grade itself and go through this process of self-learning.”

Other companies, including tech giants like Microsoft, Meta, OpenAI, and Anthropic, are already using synthetic data to train flagship AI models. Gartner estimates 60% of the data used for AI and an­a­lyt­ics projects in 2024 were syn­thet­i­cally gen­er­ated.

Microsoft’s Phi-4, which was open-sourced early Wednesday, was trained on synthetic data alongside real-world data. So were Google’s Gemma models. Anthropic used some synthetic data to develop one of its most performant systems, Claude 3.5 Sonnet. And Meta fine-tuned its most recent Llama series of models using AI-generated data.

Training on synthetic data has other advantages, like cost savings. AI startup Writer claims its Palmyra X 004 model, which was developed using almost entirely synthetic sources, cost just $700,000 to develop — compared to estimates of $4.6 million for a comparably-sized OpenAI model.

But there as disadvantages as well. Some research suggests that synthetic data can lead to model collapse, where a model becomes less “creative” — and more biased — in its outputs, eventually seriously compromising its functionality. Because models create synthetic data, if the data used to train these models has biases and limitations, their outputs will be similarly tainted.

data Article's
30 articles in total
Favicon
Why Schema Compatibility Matters
Favicon
Massively Scalable Processing & Massively Parallel Processing
Favicon
Interactive Python plots: Getting started and best packages
Favicon
Dados da Web
Favicon
Google and Anthropic are working on AI agents - so I made an open source alternative
Favicon
Efficiently Deleting Millions of Objects in Amazon S3 Using Lifecycle Policy
Favicon
Elon Musk agrees that we’ve exhausted AI training data
Favicon
Data Analysis Trends for Beginners: What's Popular in 2025?
Favicon
AI and Automation in Data Analytics: Tools, Techniques, and Challenges
Favicon
High-Demand Tools and Platforms for Freelance Data Analysts in 2025
Favicon
Using proxy IP for data cleaning and preprocessing
Favicon
Quickly and easily filter your Amazon CloudWatch logs using Logs Insights
Favicon
A Guide to Manage Access in SQL - GRANT, REVOKE, and Access Control
Favicon
Weekly Updates - Jan 10, 2025
Favicon
Solving the Logistics Puzzle: How Geospatial Data Visualization Optimizes Delivery and Transportation
Favicon
🔍 Handling Missing Data in Python for Real-World Applications
Favicon
A Quick Guide to SQL Data Modification Commands with Examples
Favicon
chkbit checks for data corruption
Favicon
Enterprise Data Architecture and Modeling: Key Practices and Trends
Favicon
What kind of Data Team should I join?
Favicon
Proxy IP and crawler anomaly detection make data collection more stable and efficient
Favicon
What data can crawlers collect through HTTP proxy IP?
Favicon
Pandas: Conversion using loc and iloc
Favicon
The Only Thing Successful Entrepreneurs Care About..
Favicon
Session management of proxy IP in crawlers
Favicon
The Unofficial Snowflake Monthly Release Notes: December 2024
Favicon
A Closer Look at the Top 5 Data Protection Software in 2024
Favicon
The beginning of my journey
Favicon
Hi! Just finished my first blogpost here, with some test of DuckDB and OSM data. Public notebook attached! ;)
Favicon
How Data Analytics in the Cloud Can Level Up Your App

Featured ones: