
Lyft releases self-driving research dataset - dchengy
https://medium.com/lyftlevel5/unlocking-access-to-self-driving-research-the-lyft-level-5-dataset-and-competition-d487c27b1b6c
======
jedberg
> Academic research accelerates innovation, but it requires costly data that
> is out of reach for most academic teams.

This is true of pretty much any AI research. Look at Puffer[0], which was just
on HN a couple of days ago. They're running a free streaming service just to
get enough data to train their algorithms, and in fact mention in their FAQ
that they would love to use commercial data if they could get it.

Unfortunately, academic and commercial incentives don't really align here.
Most commercial entities don't want to share their data because it's valuable
to them, and if they let researchers in, they want the output of the research
to remain proprietary to their commercial enterprise.

I wonder if there isn't some sort of governance solution to this. Like give
companies big tax breaks for sharing their data with researchers, or something
like that. Essentially subsidize academia indirectly.

[0] [https://puffer.stanford.edu/player/](https://puffer.stanford.edu/player/)

~~~
ISL
I've seen semiconductor industry companies collaborate on grant-funding
fundamental condensed-matter physics research. If it is a question of interest
to all parties, and the work is too blue-sky to be immediately profitable,
sometimes they'll fund the work.

------
choppaface
Just some context here:

* The raw data in nuscenes ( [https://www.nuscenes.org/](https://www.nuscenes.org/) ) is about 5x larger than this dataset from Lyft. 300GB train vs 60GB train. Argoverse ( [https://www.argoverse.org/](https://www.argoverse.org/) ) is is about 3.x larger at 200GB. The Waymo dataset will (allegedly) be an order of magnitude larger than nuscenes ( [https://i2.wp.com/syncedreview.com/wp-content/uploads/2019/0...](https://i2.wp.com/syncedreview.com/wp-content/uploads/2019/06/image-55.png?resize=1024%2C768&ssl=1) ). BDD100k ( [https://bair.berkeley.edu/blog/2018/05/30/bdd/](https://bair.berkeley.edu/blog/2018/05/30/bdd/) ) is the "largest" public dataset to date, but lacks lidar, and labels are inconsistent; most of the 100,000 scenes only have one labeled frame.

* The Lyft sensor suite has bumper-mounted lidar, which is absent from other existing datasets. Point cloud data in these areas is critical for pedestrians, bikes, and various road hazards. So this dataset alone is useful for validating work trained through other means.

* The current Lyft Level 5 release has no explicit test / validation set, which is crucial for properly measuring performance of any experiment one might do with the data. In nuscenes and Argoverse, there's a small snippet dataset that helps you prepare your pipeline. Feels like Lyft might have rushed things a little here-- they could have posted a "teaser" and then the full train and test/validation set a couple weeks later.

Great to see more public data (especially from a more modern sensor suite),
plus investment into a contest with prizes.

------
cardigan
(I work at scale)

Hmm this blog post and the website doesn't mention that this dataset was
mostly annotated by Scale (scale.ai), as part of a partnership with Lyft ...
We're going to publish a blog post about this soon, but if anyone at Lyft is
reading this, please figure out how to reasonably credit Scale since I doubt
leaving out Scale completely from the announcement is in the spirit of the
agreement. Scale should probably also be added to the bibliography and website
in some form

Contrast this with the nuScenes website, which was also annotated by Scale,
and whose data format set the standard for this dataset: they credit Scale
pretty reasonably

~~~
ayw
Hi, I'm the CEO of Scale.ai.

This comment does not represent the company's viewpoint, and cardigan is not
speaking on behalf of Scale.

We are very excited to have been able to work with Lyft in open-sourcing this
dataset and advancing the research community. We are also very grateful to
Lyft for choosing to leverage our point cloud viewer and have credited the
annotations to us on their launch page.

~~~
atarian
This is pretty effective marketing... generate some fake controversy over some
small slight and have it go viral on HN.

~~~
thetrainfold
(I work in marketing at scale)

Thank you.

(I don't really)

------
azinman2
$25k in prizes seems silly given this is a multi-billion dollar market to
crack.

~~~
arathore
Such competitions do not usually result in a comprehensive "solution" by
themselves - pushing the state-of-the-art is more common. Also the value is
not going to be derived solely from the algorithm but more from its deployment
to real world applications and the surrounding infrastructure to make it
possible.

~~~
mkagenius
> pushing the state-of-the-art is more common

But do not forget there will be 10s (if not 100s) of people working on this
for 30 days. The man-hour this competition will use is highly disproportionate
to the amount offered overall.

~~~
tal8d
That is the point:
[https://en.wikipedia.org/wiki/Orteig_Prize](https://en.wikipedia.org/wiki/Orteig_Prize)

------
jacobn
“There will be $25,000 in prizes, and we’ll be flying the top researchers to
the NeurIPS Conference in December, as well as allowing the winners to
interview with our team.”

I guess it’s a decent opportunity if you’re trying to break into DL?

~~~
bearpelican
I had a bad experience with Lyft's previous self driving challenge -
[https://www.udacity.com/lyft-challenge](https://www.udacity.com/lyft-
challenge)

I unofficially got first place after finding a bug in their test set (allowing
me to blow the competition away). I reported the problem directly - they
decided not to fix it and asked me to take my submission down. They said
they'd still offer an interview.

However - this interview wasn't even for their DL team. They offered an
interview with the web tools support team because they felt I didn't have
enough experience...

Reference - [https://github.com/bearpelican/lyft-perception-
challenge](https://github.com/bearpelican/lyft-perception-challenge)

~~~
mkagenius
> they decided not to fix it and asked me to take my submission down.

Imagine if it was a bug in their app leaking millions of users data and they
go "We do not fix it. You can only use Uber now" "Also, come for an interview
if you want to fix our web tools instead"

------
yodon
The post indicates there is a competition and prizes but I'm not seeing any
discussion of what sort of license the data is being made available under (or
the competition for that matter). Hopefully it's there and I'm just not seeing
it.

~~~
ekc
The Github they link to says it's under the CC BY-NC-SA 4.0.

~~~
bigiain
Hmmm. I wonder how much fun lawyers will have arguing about whether that "NC"
clause means a model trained on this data cannot be used commercially by the
researcher who built it?

~~~
parsimo2010
I would guess (I'm not a lawyer) that a commercial model would be in the clear
as long as the company doesn't release anything including the data itself.
Model weights that are derived from the data are not the data.

I would make an analogy where the training data is like a textbook. If I read
in a textbook about how to design/build a bridge, I don't have to give
royalties to the textbook author when my civil engineering and construction
firm gets paid to build a bridge. The copyright/license of the textbook can't
prevent me from using the knowledge gained from the book to do a commercial
job. In a similar vein, the knowledge gained from a public data set is
probably fair game for whatever you want, just as long as you aren't
repackaging the data itself. There's probably a good boundary requiring
someone to stop at a point where the model weights can't be used to
reconstruct the original data.

Of course, other people can disagree. I would look forward to an actual legal
opinion to clear this up.

~~~
jahewson
I disagree. The license specifies that "using the material for commercial
purposes" is prohibited. The act of training a commercial model is obviously a
commercial purpose. Whether or not the data is somehow incorporated into the
resulting model is irrelevant.

You're confusing this CC license with open source licenses that do not
restrict _use_ but require derived works to be created/distributed under
certain conditions. This CC license _restricts use_ , in that you are not
allowed to use the data for any commercial purpose - this is totally different
from open source, and more like the "academic use only" licenses which used to
be more common.

~~~
yodon
CC is rarely a good license to choose other than to make yourself feel good.
It does a terrible job of dealing with the actual matters of importance to a
license outside of photo sharing (and even there it's not a good choice, as
many photographers have found because it explicitly grants rights to others
that the photographer may not be able to to grant to them, leaving the
photographer open to lawsuits as has happened).

In this case, imagine a student working on a class assignment. They use this
data for purely academic purposes with no commercial intent in mind. After
they train their system, they realize yow I could use this trained system and
get rich. There was arguably no commercial use during the training. The use of
the data was purely academic, like a person learning math or French. What you
do after running the learning is a separate matter, just as using a CC
licensed textbook to learn math doesn't prevent you from getting a job as a
statistician.

Again, the tl;dr is instead of trying to divine how a court will deal with a
poorly specified problem, it's much better to just not license your stuff
using a CC license. There are almost always much better licenses to choose
from.

~~~
bigiain
I'm curious about what you (or anyone else) would recommend as better license
choices for datasets that might be used in machine learning model training?

(With, I suppose, hints about what restrictions you might be wanting to grant
or prohibit by particular license options?)

~~~
yodon
IANAL but my sense is that it will be a number of years before we really know
what works with regards to this kind of licensing of DNN/ML training data
sets. It's almost certainly going to be decided on the basis of what's called
"case law", which really just means "a bunch of random judges who don't know
anything about ML made decisions on a bunch of random lawsuits that probably
were not good samples to pick and which were probably taken to court by pairs
of parties with wildly different abilities to pay lawyers and now we are stuck
with those decisions as precedent for future cases."

If that sounds like a crappy legal footing for the next 20 years of software
development, yeah, it is. It's also why Mitch Kapor and a few others founded
and funded the EFF to try to encourage better case law decisions around the
early days of electronic privacy law. We definitely need an EFF like effort
around ML/DNN/etc., but I'm not holding my breath.

I wish I had a better answer, I'm mostly hoping someone else here does.

------
vinayms
This will go against the grain here on HN. I don't know how anyone can imagine
self driving even succeeding in real world, leave alone in the so called third
world, unless _all_ vehicles are self driven and operate in a controlled
environment. There are some really important things pending, like accurate NLP
and computer vision, but no, we need something shiny and useless. I think some
smart computer scientists are getting rich by carrot sticking some gullible
billionaire investors. Good for them. I hope some of the really useful stuff
piggy back on this rather lofty endeavor.

~~~
buboard
I 'm with you i think self-driving is an aspiration rather than a concrete
goal. It literally means solving the quintessential problem of robotics which
is a very very hard problem. We 'll probably have human level NLP before that.

Car driving is in decline
[http://www.washingtonpost.com/blogs/wonkblog/files/2013/04/m...](http://www.washingtonpost.com/blogs/wonkblog/files/2013/04/miles-
driven-CNP16OV-adjusted.gif)

A more concrete goal for transportation would be to reduce driving times even
more by adopting remote work. That is reachable within the decade. In the
meanwhile, car safety features should be ramped up, but autonomous driving so
far doesn't seem very safe.

------
astrostl
> Self-driving is too big — and too important — an endeavor for any one team
> to solve alone. Transportation serves all of us, and we should all be
> invested in the next step of its evolution.

Imagine how much better things could be if everyone working on maps felt the
same.

------
samirsd
spicy

------
jaimex2
It's looking more and more like everyone is just going to have to licence
Tesla's FSD when its finished.

They are the only ones with a broad real world data source and seem to have
wisely taken the right path by not adopting LIDAR, focusing purely on passive
vision.

~~~
KaiserPro
Its not really that wise, its a large gamble.

tesla decided to not include lidar because they couldn't find a manufacturer
that would make one cheap enough for them/fell out over terms. Its not a
statement of vision. Its exactly the same decision that Apple dropped Flash
support for the iPhone, The processor and ram were too limited to support it,
adobe refused to make compromises, and it was too late to change before
launch.

Firstly, Tesla is not focussing purely on passive vision, they are using radar
as well. But because radar is nowhere near high resolution enough, they need
vision to provide categorization.

Now Musk makes a lot of noise about avoiding lidar, thats mostly because he
knows its a massive gamble. Yes, he bleats on about its power budget and cost,
but using pure AI cost a whole more in RnD, plus a boat load of latency. Not
to mention the massive power budget needed to run the custom silicon.

_eventually_ vision + radar will be more than enough to provide life critical
level 5 autonomy. However Tesla barely provide more than level 2.

They have a number of problems to overcome, rain/bug occlusion of vision
sensor, low light performance, sunrise/sunset, fog, reliable realtime depth
estimation, etc.

I suspect that CCD based time of flight depth sensors will become cheaper, low
power, and small (They are almost certainly going to end up in mobile phones
soon) before pure vision realtime life critical depth estimation is a thing.

~~~
jaimex2
Anytime you do something new its a gamble and things will go wrong.

Tesla has never been shy of using expensive components for their products.
Their entire strategy relies on bringing down battery costs.

Musk makes a very good argument for his choice to not use LIDAR - nothing
alive on this planet uses it. Every creature navigates our world on vision,
even dolphins and bats use sonar as a secondary system. The proof of concept
is everywhere, Tesla is just adopting it.

And yeah, they're purely level 2 at the moment. I don't agree with their
marketing as it does lead some to use it as something more before its ready.

~~~
KaiserPro
> Musk makes a very good argument for his choice to not use LIDAR - nothing
> alive on this planet uses it. Every creature navigates our world on vision,
> even dolphins and bats use sonar as a secondary system. The proof of concept
> is everywhere, Tesla is just adopting it.

Nothing alive uses wheels, apart from us. Its a null argument.

I don't mind a gamble, what I mind is patent bullshit. If he'd been honest and
said "Lidar is great, but its too expensive, and the manufactures are not
willing to compromise" and left it at that, it'd be ok. But he didn't He
dressed it up in semi prophetic _wank_. Now we have legions of "experts"
blindly parroting that ToF sensors are dead and will be never used into the
future.

I'm willing to be that a Lidar-like Time if Flight sensor will be shoved into
a smartphone in the next 4 years. why? because it makes SLAM so much more
easy/immersive/better. Which makes AR better, and more useful.

Once that happens, all bets are off.

