> In effect, every Lyft vehicle in operation today, with a smartphone on the dashboard, could be commandeered to become a “camera” watching, surveying and mapping the roads that those cars drive on, and how humans behave on them, using that to help Lyft’s autonomous vehicle (AV) platform learn more about driving overall.
Another instance of the "more data is better" fallacy. Humans can drive cars safely after only fifteen years of intermittent sensory input. Lyft could collect that "data" within just one year employing ten collectors. That still doesn't give you brains.
Is Techcrunch an outlet for PR pieces? I'm asking because that article reads like one of those.
Depends on whether it’s the right data, or meaningful data. The idea that Lyft needs this acquisition in order to implement video capture from its drivers’ cell phones is laughable so something else is going on with this acquisition. But to your point, the assumption that this is even a problem of “not enough data” is questionable at this point. How to turn that data into results is something no one has come close to figuring out yet.
> Depends on whether it’s the right data, or meaningful data.
Street level mapping data isn't relevant or meaningful? Basically every company working on this problem seems to pretty strongly disagree with you.
> But to your point, the assumption that this is even a problem of “not enough data” is questionable at this point. How to turn that data into results is something no one has come close to figuring out yet.
This is trivially false. Given infinite data, all possible situations would be represented in the data, and the solutions applied in those situations could be copied exactly, something that existing algorithms are completely capable of doing.
>> This is trivially false. Given infinite data, all possible situations would be represented in the data, and the solutions applied in those situations could be copied exactly, something that existing algorithms are completely capable of doing.
In principle. In practice, you'd need infinite time and infinite storage.
Btw, do you have to add stuff like "This is trivially false" to your comments? It doesn't make your comments sound more right, only less well considered.
> In principle. In practice, you'd need infinite time and infinite storage.
That is irrelevant.
> Btw, do you have to add stuff like "This is trivially false" to your comments? It doesn't make your comments sound more right, only less well considered.
Trivial in the mathematical sense. As in, there is a trivial counter-example to your point. Citing infinity is a 'trivial' case. I'm using 'trivial' to describe my counter-example, not his error.
Given infinote fata, infine storage and infinite computing you would be right. In practice it means you are wrong. Feeding more data does not necessarily help given a finite amount of computing power.
More data is necessary with current technoloogy, in the sense that modern statistical machine learning algorithms are very bad at generalising to unseen data, and the only way to overcome this is to give them more examples.
There are machine learning techniques that generalise well from few data, but they are not very well known in the industry.
Also, though more speculatively, I think the idea of "lots of data" is attractive to marketing departments. There's something about algorithms that need huge amounts of data and compute, that only a select few companies can use. I guess it gives bragging rights, of a sort: "we got the biggest data around. Buy our stuff!".
> More data is necessary with current technoloogy, in the sense that modern statistical machine learning algorithms are very bad at generalising to unseen data, and the only way to overcome this is to give them more examples.
Precisely.
> There are machine learning techniques that generalise well from few data, but they are not very well known in the industry.
Sure, and we'd all love to be using those. But even if you generalize well from small datasets, you still generalize better from larger ones.
> Also, though more speculatively, I think the idea of "lots of data" is attractive to marketing departments. There's something about algorithms that need huge amounts of data and compute, that only a select few companies can use. I guess it gives bragging rights, of a sort: "we got the biggest data around. Buy our stuff!".
It may be attractive to marketing departments, but it is also essential to data science projects like this.
My point is that they all generalize better from larger datasets. Size is relative and some techniques work better with more or less data. Linear regression, for instance, can work quite well with much less data than a neural net. It just depends on the complexity of the problem.
>> My point is that they all generalize better from larger datasets.
Like I say, this is not the case. There are learning algorithms that generalise so well from few data that their performance can improve only marginally with increasing amounts of data, or not at all.
I appreciate that you probably have no idea what I'm talking about. I certainly don't mean linear regression.
> Like I say, this is not the case. There are learning algorithms that generalise so well from few data that their performance can improve only marginally with increasing amounts of data, or not at all.
Erm, no. Not unless they are solving the problem perfectly.
> I appreciate that you probably have no idea what I'm talking about. I certainly don't mean linear regression.
I work in the field. I'm quite certain i'm familiar with whatever it is that you think you're talking about.
The category of algorithms that attempt to learn things from few examples is called 'One shot learning'. It's usually in the context of image classification, but it applies equally well elsewhere. These algorithms still learns better from more data.
Do feel free to share an example of an algorithm that generalizes better from less data. I'll wait.
>> Erm, no. Not unless they are solving the problem perfectly.
Well, yes, that's what I mean.
I gave an example here a while ago, of how a Meta-Interpretive Learning
algorithm, Metagol, can learn the aⁿbⁿ grammar perfectly from 4 positive
examples:
That's typical of Metagol, as well as other algorithms in Inductive Logic Programming, the broader sub-field of machine learning that MIL belongs to.
>> Do feel free to share an example of an algorithm that generalizes better from
less data. I'll wait.
To clarify, my claim is that there are algorithms that learn adquately from
few data and therefore don't "need" more data. Not that less data is better.
That said, there are theoretical results that suggest that a larger hypothesis
space increases the chance of the learner overfitting to noise. So what is
really needed in order to improve generalisation is not more data, but more
relevant data. Then again, that is the subject of my current PhD so I might
just be interpreting everything through the lens of my research (as is typical for PhD students).
A small amount of the right data is better than lots of the wrong data. Collecting a lot of some data, because it's easy to collect isn't very helpful if it turns out to be the wrong data.
It would likely be more informative to instrument a few cars with some advanced sensor package and let well ranked drivers drive them around than to try to gather data from smartphones in existing cars, but I suppose it depends on what the end use is.
More data is not always better, it can be for sure, you need to have the analytical capabilities to turn it into useful information. Otherwise it's just hoarding.
How is having recorded video of actual roads not valuable data? You're assuming the only option is training self driving cars on it. They might be training something totally separate to recognize signs, or see damaged roads, how quickly pedestrians react at different times of day, etc.
I agree that having detailed maps is an advantage. But they only make you a better driver in the places where they are correct. As such you can't rely on them to learn to drive, because you must be able to adapt to changes. Driving is not about the best case, but about the worst case.
Using uncalibrated and random positioned phone cameras to learn about driving a car that has better sensory equipment seems backwards to me. But point taken, the article says "learn more about driving overall." So that could be anything.
To reiterate, you cannot, ever drive based on any map, regardless of how many smartphones collected that map's data. You might use a map for navigation, but even then you'll have to deal with closed roads or changed traffic flow that isn't yet on the map. You cannot use maps for driving.
Well that's my sentiment too. But the article is wishy washy on the use of the collected data. Having local knowledge can make you a better driver after all.
Another instance of the "more data is better" fallacy. Humans can drive cars safely after only fifteen years of intermittent sensory input. Lyft could collect that "data" within just one year employing ten collectors. That still doesn't give you brains.
Is Techcrunch an outlet for PR pieces? I'm asking because that article reads like one of those.