
Berkeley Deep Drive Dataset - MuhammedAbiola
http://bdd-data.berkeley.edu/
======
quickben
The website doesn't work in Firefox.

"FAQ: The download buttons do not work".

"The website is fully supported by Chrome now"

The current state of the internet, has reached a _very_ low point. One would
expect more from berkley. _edu_.

EDIT: I find the downvotes preposterous. Are we somehow supposed to expect
requiring a proprietary browser to simply download a file from an educational
institution now?

~~~
ironjunkie
Since when is Chrome a reference?

If you want something not proprietary, why don't you use Firefox?

~~~
MBCook
It’s in DOWNLOAD button. How does it not work in EVERY browser?

A link would. A form submission would. Ultra simple JavaScript would.

It’s not a question of “why doesn’t chrome work“ but more a question of “how
is this even an issue“.

~~~
derefr
It's a really _big_ download (1.8TB). One where both you and they would be
really rather perturbed if the download failed at 90%.

In fact, they'd probably be perturbed by the bandwidth costs even if everyone
who wanted the dataset was only downloading it once.

Maybe it uses WebTorrent?

(Not sure why it couldn't just fall back to giving you a .torrent file in that
case, though.)

~~~
ori_b
Thankfully, Firefox has the ability to resume broken downloads. As does nearly
every other method of downloading, other than (apparently) Chrome.

~~~
derefr
Broken downloads, yes. _Corrupted_ downloads, no. Given that files served from
CDNs are still usually served without HTTPS, there aren't many checksums
between the two ends of the pipe to protect it from on-the-wire corruption.
Doesn't matter much for video streaming ala Netflix; matters a lot for a
structured dataset.

BitTorrent and related protocols handle this automatically by breaking the
file into large (megabyte-range) chunks, and then putting the cryptographic
hashes of all the chunks in the manifest. As long as you've received the
manifest, you can protect against both passive corruption and active MITMing
in the same way you resume broken downloads: by just discarding chunks that
failed to complete to a state of "has all the bytes and hashes correctly", and
trying those chunks again.

(Sadly, HTTP doesn't support a digest response header that applies to each
chunk of a "Transfer-Encoding: chunked" response stream, or it could vaguely
compete with this. The Content-MD5 header could have done this, but it was
removed precisely because implementations were in conflict on whether it was
for this, or for hashing the document as a whole.)

------
adpirz
In case you don't want to register and are curious about some metadata:

Videos: 100K video clips: Size:1.8TB

Info: The GPS/IMU information recorded along with the videos: Size: 3.9GB

Images It has two subfolders. 1) 100K labeled key frame images extracted from
the videos at 10th second 2) 10K key frames for full-frame semantic
segmentation.: Size: 6.5GB

Labels: Annotations of road objects, lanes, and drivable areas in JSON format.
Details at Github repo.: Size: 147MB

Drivable Maps: Segmentation maps of Drivable areas.: Size: 661MB

Segmentation: Full-frame semantic segmentation maps. The corresponding images
are in the same folder.: Size: 1.2GB

~~~
sytelus
While this is largest public annotated dataset, it's still only from 2 or 3
cities if I'm reading this right.

~~~
Isamu
4 regions - see page 7 of the paper. New York, Berkeley, San Francisco, and
"Bay Area"

[https://arxiv.org/pdf/1805.04687.pdf](https://arxiv.org/pdf/1805.04687.pdf)

------
trillic
License:

"""

Copyright ©2018. The Regents of the University of California (Regents). All
Rights Reserved.

Permission to use, copy, modify, and distribute this software and its
documentation for educational, research, and not-for-profit purposes, without
fee and without a signed licensing agreement; and permission use, copy, modify
and distribute this software for commercial purposes (such rights not subject
to transfer) to BDD member and its affiliates, is hereby granted, provided
that the above copyright notice, this paragraph and the following two
paragraphs appear in all copies, modifications, and distributions. Contact The
Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, Suite 510,
Berkeley, CA 94720-1620, (510) 643-7201, otl@berkeley.edu,
[http://ipira.berkeley.edu/industry-info](http://ipira.berkeley.edu/industry-
info) for commercial licensing opportunities.

IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT,
SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING
OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS
BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED
HEREUNDER IS PROVIDED "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE
MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

"""

~~~
rectang
If I'm reading that right, it's not an open source license. It has a field-of-
use restriction because only Berkeley Deep Dive members can use it for
commercial purposes.

EDIT: The title of this HN topic is wrong. It's not what's in the source and
it to be changed. (I'm relieved that it's just a submitter summarizing
incorrectly and that Berkeley Deep Dive was not responsible for this mistake.)

~~~
mehrdadn
> If I'm reading that right, it's not an open source license. It has a field-
> of-use restriction

How do _usage_ terms dictate whether the _source_ is open?

~~~
rectang
"Open" is not just whether you get to see it. Among other things, it's whether
the software is open for all users to _modify_.

The Open Source Definition is curated by the Open Source Initiative and has
been stable for many years. A huge industry rests on top of it.

[https://opensource.org/osd](https://opensource.org/osd)

~~~
mehrdadn
So you're saying neither (say) VirtualBox nor other GNU software qualify as
open-source? They seem to fail the very first sentence of criterion #1, since
they place restrictions on when/how you can redistribute the source/software
in aggregation with other sources/software.

~~~
ghaff
The GPL absolutely falls under the open source definition. You can ship GPL
programs and code alongside programs that may have different licenses. What
you can't do is _combine_ GPL code with code that has incompatible licenses.

Copyleft imposes some requirements on redistribution. It does not impose
restrictions on usage at all.

~~~
mehrdadn
> Copyleft imposes some requirements on redistribution. It does not impose
> restrictions on usage at all.

I wasn't saying copyleft imposes restrictions on usage.

The first "open-source" criterion says the following (and note that, like you
said, this is a restriction on _redistribution_ and _not_ usage):

> The license shall not restrict any party from selling or giving away the
> software as a component of an aggregate software distribution containing
> programs from several different sources.

We both agree GPL places a restriction on redistribution (namely: that it must
be with source code). However, criterion #1 says very clearly that the license
_can 't_ place restrictions on software redistribution when it's aggregated
with software from different sources.

This is a pretty clear contradiction to me. The fact that you cannot
redistribute GPL software without source (whether bundled with other software
or otherwise) is a restriction on whether/how you can redistribute GPL
software, hence it goes against the "shall not restrict" requirement. And
there's no exception carved out for "restrictions that require source code to
be included". So I don't see how we get to ignore this and cherry-pick what
restrictions actually fall under "restrictions"...

~~~
codetrotter
> namely: that it must be with source code

GPL does not say that. What it does say is that you must provide the source
upon request.

> The fact that you cannot redistribute GPL software without source (whether
> bundled with other software or otherwise) is a restriction on whether/how
> you can redistribute GPL software, hence it goes against the "shall not
> restrict" requirement. And there's no exception carved out for "restrictions
> that require source code to be included".

So all of what you said there is simply incorrect, because like I said, you
absolutely _can_ distribute GPL software without including the source code
alongside it. And that is what is done by everyone 99% of the time.

You only need to provide the source upon request to the people that ask you
for it.

I encourage you to take the time to read the GPL FAQ. Even though GPL is not
my preferred license I think it is important to have a good understanding of
it. [https://www.gnu.org/licenses/gpl-
faq.en.html](https://www.gnu.org/licenses/gpl-faq.en.html)

~~~
mehrdadn
>> namely: that it must be with source code

> GPL does not say that. What it does say is that you must provide the source
> upon request.

Yes, I was being brief. I'm well aware. [1]

[1]
[https://news.ycombinator.com/item?id=17202439](https://news.ycombinator.com/item?id=17202439)

\-------------

To your edit:

> So all of what you said there is simply incorrect, because like I said, you
> absolutely can distribute GPL software without including the source code
> alongside it. And that is what is done by everyone 99% of the time. You only
> need to provide the source upon request to the people that ask you for it.

No, it makes no difference at all. You cannot redistribute the software unless
you are willing and able to redistribute the source code as well. That is very
clearly a restriction on your redistribution of the software. The fact that we
happen to be talking about _the software 's own source code_ makes no
difference as to whether it's a restriction or not. It'd be a restriction
whether we're talking about "source code", or "$100,000", or anything else.
The simple fact that you have to be willing and able to provide {something}
before you can redistribute the software is obviously a restriction on your
redistribution of the software.

------
rck
The license seems a bit odd. It refers to software, but I can't see any
software in any of the downloads - just data. Clearly the license is intended
to cover the contents of the downloads, but the wording seems wrong then.

I'm not a lawyer, so maybe someone with more expertise could chime in...

------
itchyjunk
I am curious, does training the AI on other driving data sets help? What I
means is, not just sedan data set. Trucks, buses, maybe 2 wheelers? Would this
help the model generalize more and make better prediction of how other
vehicles work or would it just add noise?

~~~
cr4zy
It helps a lot in my experience. So in simulation I tried this with imitation
learning by training on a hood camera, a camera at the height of what a semi-
truck hood would be, and another camera offset to the left 1.5m along with
steering and throttle for labels. I also added random noise to the position
(less than a meter), rotation (less than degree), fov (less than a degree),
capture height (< 1%), and capture width (< 1%). The result was a 3x higher
average score on a driving benchmark where the score was meters driven minus
seconds taken, second-meters of lane deviation, and seconds where acceleration
surpassed 0.5g forces (to measure comfort). The dataset, training code, and
sim are at deepdrive.io - a different entity with the same name :)

------
bytematic
This made me wonder, how will driver-less cars interpret roads that have very
faded, non-existent, or unique lane markings? I imagine rural, park,
construction, and weather like snow/dust can really obscure things.

------
arielbaz
First, thanks for sharing this data. Second - why on earth would anyone create
a 1.8TB zip file of 100k videos? Likely the video encoder already compressed
every possible bit out of these videos, zip is _not going to make it smaller_.
It is, however, going to make it mandatory for everyone to download the full
1.8TB file even to get a single video out of this archive. Makes me wonder
what else is happening here (like the chrome only download link which is
hosted on another domain, and the non-https login, and that escalator to
nowhere..*)

------
faitswulff
Tangent - does anyone know if Waymo is user-generated driving data from
Navigation mode on Google Maps for their self-driving data? Or if that would
even be feasible or useful?

~~~
henryfjordan
GPS + IMU data from phones might be useful but neither is particularly
accurate compared to video data. Maybe the IMU data would be useful to make
the car feel more "human"

------
syntaxing
I'm confused...are the datasets free (as in for profit and nonprofit) to use
and the pretrained models (mentioned in the paper) are under the UC license?

------
Animats
This is nice, but it assumes that grinding on lots of successful driving video
is a valid path to automatic driving.

~~~
taneq
Presumably lots of successful driving video is at least a prerequisite to
_validating_ an automatic driving system, even if you're not using it for
training?

------
sanxiyn
Is there a similar dataset for lidar? My understanding was that lidar is more
important than video. Has it changed?

------
Uberphallus
Does it include accidents and incidents with correct maneuvers? That's the
training set hard to come by.

------
syntaxing
Before I register to download the data, is there a smaller dataset for you to
play with on the portal? I've been itching to do something fun after taking my
SDC from Udacity. Nut1.8TB is way more than I can handle right now. Can
someone upload a portion of this? (<10GB)?

------
jamesblonde
I am downloading this dataset. It will take about 40 days, is my guess....

~~~
nerdponx
How would one go about creating a torrent for it? Or uploading to IPFS?

~~~
web007
I already emailed the creator a couple weeks back to request / offer a
torrent, but haven't heard anything back.

The problem here is that both of your suggestions involve a 2-step process:

1\. Download the file

2\. Create a torrent from it, or upload it to IPFS

Since step 1 is already a 2TB download, getting to either version of step 2 is
untenable. I agree with one of the other posters in this thread, the default
for something like this should be torrent since you get both distribution and
checksumming for free.

It would also be nice if it wasn't a 2TB zip file, which then has to be
unzipped onto another 2TB of storage for practical use.

~~~
jamesblonde
Subject to licensing, we intend to make the dataset available (along with
loads of other big datasets for ML) using a bit-torrent like program called
Dela for the Hops Hadoop platform. Maybe in 3 weeks or so, it will all be
released - with this dataset. Dela integrates with HDFS/S3/GCS backends, and
it supports NAT traversal, and a delay-based congestion control over UDP -
good for high bandwidth/high latency networks. See
[http://www.hops.io](http://www.hops.io) and our paper -
[https://ieeexplore.ieee.org/document/7980225/](https://ieeexplore.ieee.org/document/7980225/)

------
supergirl
good luck downloading them with 10KB/s.

