How to Make the Best Self-Driving Dataset

Published at

1/14/2025

Let's Recap the Workshop!

Here is a quick highlight...

🚗 The Race for Autonomy

The self-driving revolution is in full swing, and the question remains: Who will break the threshold to release self-driving cars into the world safely? A combination of hardware, software, and strategy is shaping this landscape.

💪 The Power of GPUs

Nvidia CEO revealing the RTX 5090 at CES

First up: GPUs. It’s no surprise to anyone following the field that powerful GPUs are driving innovation. With better hardware, we can train faster and run more complex algorithms directly on cars.

I’ve seen this firsthand—some self-driving teams are removing spare tires to install racks of GPUs inside their vehicles! These advancements not only allow rapid model training but also support scaling. With GPUs constantly running training scenarios, we’re racing to build the best models faster than ever.

🛠️ Improved Libraries

Hardware isn’t the only area seeing leaps forward. Open-source libraries have become much better too. A few years ago, setting up tools like PyTorch 3D was a nightmare. Today, it takes hours, not days, thanks to optimization and community support. These advancements have streamlined workflows and allowed for faster experimentation.

🌐 Data at Scale

Self-driving datasets are unmatched in size and complexity. We’re talking about petabytes of data: multi-camera setups, LiDAR, radar, and richly annotated maps of entire cities. Companies like Waymo and Tesla are collecting hours of sensor data, creating unparalleled datasets.

But the challenge lies in organizing and managing this unstructured data. That’s where tools like FiftyOne, Voxel 51’s open-source dataset management tool, come in.

🧠 Top-Tier Talent Behind the Scenes

The self-driving field attracts the best engineers and researchers. Companies like Waymo, Wayve, and Tesla are leading the charge, but much of their work remains behind closed doors for competitive reasons. Recently, however, we’ve seen a shift, with more research being published. This openness is giving us a peek into what makes their labs so innovative.

🎲 The Big Gamble

The strategies for self-driving success are as varied as the companies pursuing them:

Wayve: Aims for an end-to-end solution, teaching cars to drive anywhere by building world models.
Waymo: Focuses on mastering individual cities with detailed maps, making their vehicles highly efficient in those areas.
Tesla: Stands apart by relying solely on image-based systems, forgoing LiDAR entirely—a bold but controversial approach.

Who will win? It’s anyone’s guess. The competition is a massive gamble, and it’s thrilling to watch.

🛠️ Beginner Techniques: Curation, Digitization, and Dataset Management

Let’s get practical. Whether you’re a beginner or an expert, organizing self-driving data is the foundation. Two major challenges you’ll face:

Unstructured Data: Multi-camera systems, different sensors, varying frame rates—it’s all a big jumble.
Scale: Even hobbyists deal with massive datasets, so efficient organization is key.

This is where FiftyOne shines. With FiftyOne, you can:

Load diverse data (images, videos, radar, LiDAR) seamlessly.
Visualize, clean, and curate datasets.
Debug datasets, find gaps, and evaluate model performance.

🔧 Building a Grouped Dataset

Self-driving datasets often involve grouped samples—for example, combining frames from multiple cameras and LiDAR scans taken at the same timestamp. FiftyOne makes it easy to:

Group data by timestamp.
Load annotations for every sensor simultaneously.
Detect misclassifications, poor-quality data, or model gaps.

All using FiftyOne Grouped Datasets. The result? A well-organized dataset ready for training.

📂 FiftyOne: Your Dataset Debugger

If you haven’t tried FiftyOne yet, here’s what you’re missing:

Visualize and explore large-scale data effortlessly.
Run embeddings to discover hidden patterns.
Evaluate your models with a transparent, open-source tool.

💡 Invest in Data, Not Just Models

Cutting-edge models come and go, but high-quality data is what truly drives performance. By focusing on dataset management and curation, you can push your models to production faster and with better results.

Leveling Up ⬆️ Your Data Exploration with Embeddings and Pretrained Models

Now that we've gathered all that data and organized it, let's take things to the next level. We’ve talked about the basic metadata, but how can we push the boundaries of what’s hidden within that dataset? That’s where embeddings and pretrained models come into play.

These techniques allow us to dig deeper into our data and get more out of it than just surface-level metadata. Let’s explore how embeddings and pretrained models help us.

👯 Pretrained Models: Your New Best Friend

When we refer to pretrained models, we're talking about those powerful "zero-shot" models that have been trained on massive amounts of data and can recognize real-world objects without the need for human annotations. Imagine this: a model that can immediately identify common objects like pedestrians, traffic signs, or cars within your dataset. You don’t have to label everything manually – the model does the heavy lifting for you, with a reasonable level of confidence. Meta’s SAM2 is a great tool here! SAM2 and more are included in the FiftyOne Model Zoo!

However, the more obscure the object you're looking for, the more unlikely the model will be able to identify it accurately. But for common things like traffic signs or pedestrians, this is a quick and effective way to enrich your dataset without needing to annotate each sample by hand.

But it doesn't stop there. Another invaluable pretrained model is depth estimation, which helps you understand how far away objects are in a scene. This can be especially useful for scenarios like distinguishing between crowded city streets and more open highway environments.

🔍 Embeddings: Finding Hidden Patterns in Your Data

While pretrained models help with recognizing objects, embeddings help us understand how similar or dissimilar various samples in our dataset are to one another. Using both 2D and 3D embedding models, we can create a map of our dataset, highlighting clusters of similar samples and pinpointing where there may be gaps. This is easy with the FiftyOne Brain.

The power of embeddings is that they help us identify where our dataset may be lacking. For example, if we’re building a self-driving car dataset, it’s essential to know what real-world scenarios our car may encounter and make sure we’ve got enough diversity in our dataset to handle all those situations.

In our example, I’ll flatten my dataset and only focus on the images for now. This allows me to run embeddings using something like the CLIP model, which computes similarity between images. After generating the embeddings, we can visualize them in the FiftyOne app to explore our dataset from a new perspective.

👁️ Visualizing Embeddings in the FiftyOne App

Once we’ve generated our embeddings, we can head to the embeddings panel in FiftyOne to view the results. It’s as simple as hitting the plus button, selecting the "brain key," and using the embeddings visualization. You can also color-code the points based on different metadata stored in the dataset, like scene tokens.

As we dive into the embeddings grid, we get to see the relationships between different samples. You might notice, for example, that one cluster is all related to nighttime scenes, while another is more typical of daytime driving scenarios. These groupings help us understand what’s going on in the data and spot potential outliers that might be worth investigating.

🕵️ The Power of Embeddings for Dataset Curation

What’s really powerful about embeddings is that they help solve one of the most critical challenges in working with large-scale datasets: finding the unique, rare, and outlier samples.

If we think about the sheer volume of data collected from self-driving cars, annotating everything is simply not feasible (not to mention expensive). However, the key isn’t annotating everything – it’s about finding the data points we’ve never seen before and labeling those. Embeddings help us identify areas of the dataset that are underrepresented or have unique scenarios that our car may encounter.

This is where similarity search comes in handy. With a few clicks, you can find the most similar samples to any image in your dataset. For example, if you want to see all the traffic signs, just search for them and the system will return the closest matches. This helps us refine our data, ensuring that our model is well-trained on the things that matter most.

💪 Real-World Applications: From QA to Future Model Training

But the value of embeddings doesn’t stop with data exploration. By leveraging embeddings, we can also tackle problems like labeling mistakes, unique samples, and hardest samples – all of which play a significant role in improving the quality of the dataset and ensuring that our model training is efficient.

And speaking of improving model performance, let’s talk about how we can push the limits with the latest pretrained models, like SAM2 and Depth Anything.

🏃Unlocking Insights with SAM2 and Depth Anything

SAM2 is one of the latest segmentation models that’s making waves, particularly in its ability to segment objects like cars, roads, and even the sky. By running SAM2 on your dataset, you can instantly segment out different parts of the scene, which helps you understand how the car perceives its environment. With depth estimation, you get a better sense of how far objects are from the car, giving you even more insight into the spatial layout of your scenes.

These models are powerful tools for adding layers of insight to your data. For example, using SAM2, we can quickly identify cars, pedestrians, and drivable areas. Meanwhile, depth estimation tells us how close or far away objects are, which is crucial for accurate decision-making in self-driving systems.

🤯 Going Beyond: Expert Techniques for State-of-the-Art Self-Driving

So far, we’ve covered some of the most advanced techniques available to enhance your self-driving datasets. But the experts in the field are pushing things even further.

One of the biggest hurdles in self-driving technology is time. Training models takes time, and testing those models in real-world scenarios requires a lot of trial and error. What if you could eliminate this cycle and simulate scenarios in a controlled environment? That’s where simulation comes in.

With tools like DriveStudio and Gaussian Splats, researchers are building 3D environments where they can simulate real-world driving conditions without actually being on the road. This opens up a whole new world of possibilities for testing, validating, and improving self-driving models in a fraction of the time.

🔗 Get Started Today

Let’s take your self-driving projects to the next level! 🚀

Want to try these techniques yourself? Head to my GitHub for code snippets and examples to help you build your first grouped dataset.

dev-resources.site