The More Data the Better, Right?

Why data moats in self-driving cars are not as powerful as you may think, and more

Feb 09, 2020

Data Moats

Datasets fuel the many machine learning models powering a self-driving car. Surely the larger the dataset, the more intelligent your machine learning model is, right? Wrong! Let’s talk about why it’s not as clear cut as it may appear.

Since we don’t drive millions of miles at Voyage, I’ve been asked many times what our data moat is. Last week, we shared our answer with an in-depth post this week on our work with Active Learning. One of our data moats is in the intelligent and automated techniques to optimize our datasets, not in the dataset itself. In this post, we shared some of our work utilizing Active Learning to achieve strong model performance with relatively small datasets.

Before I share more about Active Learning, I highly encourage every founder to read this post on data moats from Martin Cascado and Peter Lauten at A16Z. A data moat predicated on the pure size of the dataset is not as strong as you may think.

Active Learning

Most machine learning models today are trained with manually curated datasets. As Mat recently tweeted, this curation is incredibly important.

Mat Leonard @MatDrinksTea

By far the hardest part of machine learning is the dataset. Collecting, labeling, and cleaning the data. Unless you're chasing state-of-the-art results, training a model is a few lines of code and a bit of time. In order of effort, IMO: 1. Data 2. Deploying 3. Model

Humans who clean and curate datasets treat it as an art-form, resulting in a process that is time-intensive and error-prone. Active Learning serves as a powerful alternative to humans, because it transforms the curation of your dataset into a science, optimizing the size and diversity of the dataset for the best possible model results. For example, NVIDIA shared that by utilizing Active Learning they saw a 3x increase for pedestrian detection and a 4.4x increase for bicyclist detection, as opposed to results with manual dataset curation. We are seeing Active Learning having a similar impact on the performance of our models.

Active Learning operates in iterations. The first iteration will train the model with a very small dataset. The results of this model are measured, and Active Learning then attempts to solve performance issues in this model by adding just the right types and quantity of data to the dataset. For example, it could be that to increase the performance of our vehicle detection performance, adding more samples of vehicles may begin to negatively impact the performance of that class. Instead, the model may crave negative samples (e.g. cats!) to improve its vehicle detection, so it can also learn what a vehicle doesn’t look like. Through many iterations, Active Learning will attempt to optimize the dataset in this fashion until it achieves the optimal model performance.

Active Learning also serves to shine a light on how to better improve your data collection process. It’s unlikely you already have the perfect dataset collected today, and Active Learning can inform which classes of object you need to gather more (or less) of going forward.

We are excited to share more about our progress with Active Learning, and I look forward to seeing it aid in continuous performance improvements in our machine learning models.

Content Search

Last week, Waymo also shared one of their data advantages: Google Search. In what should come as no big surprise, Waymo is now using similar technology to Google Image Search to query sensor data from their 20 million miles of driving. They intend to use this tool (named Content Search) to source unlabeled objects to improve the diversity of their datasets.

An example they cite is improving their detection of “oversize load” vehicles (which usually include a big sign at the front). Content Search indexes the text from that sign on images using OCR, meaning Waymo can now find all unlabeled instances of oversize load vehicles within their sensor data. Content Search, and the broader benefits of tapping into Google algorithms, could prove to be a powerful advantage against the others trying to solve the most complex driving environments.

Everything Else

Nuro introduced R2, which is the first autonomous vehicle to receive a USDOT approved exemption (in exchange for operational data)
Lyft shared a video of their self-driving car in action, including a lot of visualization data
Just how driverless are autonomous shuttles in the real world? This post shares disengagement insights from May Mobility’s deployment in Grand Rapids
The Information shared a breakdown of the development spend of autonomous vehicle programs. Waymo, Cruise, and Uber ATG were responsible for half of the $16bn spend

That’s all for this week. Please send any feedback or thoughts via Twitter, and be sure to tell any friends hungry for self-driving car knowledge to subscribe.