Hacker News new | past | comments | ask | show | jobs | submit login
Berkeley Deep Drive Dataset (berkeley.edu)
300 points by MuhammedAbiola 8 months ago | hide | past | web | favorite | 52 comments



The website doesn't work in Firefox.

"FAQ: The download buttons do not work".

"The website is fully supported by Chrome now"

The current state of the internet, has reached a very low point. One would expect more from berkley.edu.

EDIT: I find the downvotes preposterous. Are we somehow supposed to expect requiring a proprietary browser to simply download a file from an educational institution now?


This appears to be a hosted site the team put together for a public front to the dataset, not an official Berkeley page. It certainly doesn't meet any standards required by a public institution, like ADA compliance, either.


Works great on FF60.0.1


Since when is Chrome a reference?

If you want something not proprietary, why don't you use Firefox?


It’s in DOWNLOAD button. How does it not work in EVERY browser?

A link would. A form submission would. Ultra simple JavaScript would.

It’s not a question of “why doesn’t chrome work“ but more a question of “how is this even an issue“.


It's a really big download (1.8TB). One where both you and they would be really rather perturbed if the download failed at 90%.

In fact, they'd probably be perturbed by the bandwidth costs even if everyone who wanted the dataset was only downloading it once.

Maybe it uses WebTorrent?

(Not sure why it couldn't just fall back to giving you a .torrent file in that case, though.)


Thankfully, Firefox has the ability to resume broken downloads. As does nearly every other method of downloading, other than (apparently) Chrome.


Broken downloads, yes. Corrupted downloads, no. Given that files served from CDNs are still usually served without HTTPS, there aren't many checksums between the two ends of the pipe to protect it from on-the-wire corruption. Doesn't matter much for video streaming ala Netflix; matters a lot for a structured dataset.

BitTorrent and related protocols handle this automatically by breaking the file into large (megabyte-range) chunks, and then putting the cryptographic hashes of all the chunks in the manifest. As long as you've received the manifest, you can protect against both passive corruption and active MITMing in the same way you resume broken downloads: by just discarding chunks that failed to complete to a state of "has all the bytes and hashes correctly", and trying those chunks again.

(Sadly, HTTP doesn't support a digest response header that applies to each chunk of a "Transfer-Encoding: chunked" response stream, or it could vaguely compete with this. The Content-MD5 header could have done this, but it was removed precisely because implementations were in conflict on whether it was for this, or for hashing the document as a whole.)


Why even fall back? The default for something like this should be a torrent as this is exactly what BitTorrent was supposed to solve!


"only hackers use torrent"


>It's a really big download (1.8TB).

COCO (and friends) provide either cloud-backed rsync tools or curl snippets for this reason.


In this case I agree with you. I was just pointing out that we should be careful with Chrome being used as a "reference".


That part I totally agree with.

I’m really tired of people declaring other browsers “broken“ because they don’t implement the future-of-the-minute that chrome has already added.


FWIW Chrome does have the largest market share of any desktop internet browser. That said, I understand what you are getting at and agree.


That's what I accessed it with actually. It didn't work in Firefox. I clarified my post above.


In case you don't want to register and are curious about some metadata:

Videos: 100K video clips: Size:1.8TB

Info: The GPS/IMU information recorded along with the videos: Size: 3.9GB

Images It has two subfolders. 1) 100K labeled key frame images extracted from the videos at 10th second 2) 10K key frames for full-frame semantic segmentation.: Size: 6.5GB

Labels: Annotations of road objects, lanes, and drivable areas in JSON format. Details at Github repo.: Size: 147MB

Drivable Maps: Segmentation maps of Drivable areas.: Size: 661MB

Segmentation: Full-frame semantic segmentation maps. The corresponding images are in the same folder.: Size: 1.2GB


While this is largest public annotated dataset, it's still only from 2 or 3 cities if I'm reading this right.


4 regions - see page 7 of the paper. New York, Berkeley, San Francisco, and "Bay Area"

https://arxiv.org/pdf/1805.04687.pdf


First, thanks for sharing this data. Second - why on earth would anyone create a 1.8TB zip file of 100k videos? Likely the video encoder already compressed every possible bit out of these videos, zip is not going to make it smaller. It is, however, going to make it mandatory for everyone to download the full 1.8TB file even to get a single video out of this archive. Makes me wonder what else is happening here (like the chrome only download link which is hosted on another domain, and the non-https login, and that escalator to nowhere..*)


License:

"""

Copyright ©2018. The Regents of the University of California (Regents). All Rights Reserved.

Permission to use, copy, modify, and distribute this software and its documentation for educational, research, and not-for-profit purposes, without fee and without a signed licensing agreement; and permission use, copy, modify and distribute this software for commercial purposes (such rights not subject to transfer) to BDD member and its affiliates, is hereby granted, provided that the above copyright notice, this paragraph and the following two paragraphs appear in all copies, modifications, and distributions. Contact The Office of Technology Licensing, UC Berkeley, 2150 Shattuck Avenue, Suite 510, Berkeley, CA 94720-1620, (510) 643-7201, otl@berkeley.edu, http://ipira.berkeley.edu/industry-info for commercial licensing opportunities.

IN NO EVENT SHALL REGENTS BE LIABLE TO ANY PARTY FOR DIRECT, INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES, INCLUDING LOST PROFITS, ARISING OUT OF THE USE OF THIS SOFTWARE AND ITS DOCUMENTATION, EVEN IF REGENTS HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

REGENTS SPECIFICALLY DISCLAIMS ANY WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE SOFTWARE AND ACCOMPANYING DOCUMENTATION, IF ANY, PROVIDED HEREUNDER IS PROVIDED "AS IS". REGENTS HAS NO OBLIGATION TO PROVIDE MAINTENANCE, SUPPORT, UPDATES, ENHANCEMENTS, OR MODIFICATIONS.

"""


If I'm reading that right, it's not an open source license. It has a field-of-use restriction because only Berkeley Deep Dive members can use it for commercial purposes.

EDIT: The title of this HN topic is wrong. It's not what's in the source and it to be changed. (I'm relieved that it's just a submitter summarizing incorrectly and that Berkeley Deep Dive was not responsible for this mistake.)


Thanks, we've reverted the title from the submitted “UC Berkeley Open Sources Largest Self-Driving Dataset” to the original.


> If I'm reading that right, it's not an open source license. It has a field-of-use restriction

How do usage terms dictate whether the source is open?


"Open" is not just whether you get to see it. Among other things, it's whether the software is open for all users to modify.

The Open Source Definition is curated by the Open Source Initiative and has been stable for many years. A huge industry rests on top of it.

https://opensource.org/osd


So you're saying neither (say) VirtualBox nor other GNU software qualify as open-source? They seem to fail the very first sentence of criterion #1, since they place restrictions on when/how you can redistribute the source/software in aggregation with other sources/software.


The GPL absolutely falls under the open source definition. You can ship GPL programs and code alongside programs that may have different licenses. What you can't do is combine GPL code with code that has incompatible licenses.

Copyleft imposes some requirements on redistribution. It does not impose restrictions on usage at all.


> Copyleft imposes some requirements on redistribution. It does not impose restrictions on usage at all.

I wasn't saying copyleft imposes restrictions on usage.

The first "open-source" criterion says the following (and note that, like you said, this is a restriction on redistribution and not usage):

> The license shall not restrict any party from selling or giving away the software as a component of an aggregate software distribution containing programs from several different sources.

We both agree GPL places a restriction on redistribution (namely: that it must be with source code). However, criterion #1 says very clearly that the license can't place restrictions on software redistribution when it's aggregated with software from different sources.

This is a pretty clear contradiction to me. The fact that you cannot redistribute GPL software without source (whether bundled with other software or otherwise) is a restriction on whether/how you can redistribute GPL software, hence it goes against the "shall not restrict" requirement. And there's no exception carved out for "restrictions that require source code to be included". So I don't see how we get to ignore this and cherry-pick what restrictions actually fall under "restrictions"...


> namely: that it must be with source code

GPL does not say that. What it does say is that you must provide the source upon request.

> The fact that you cannot redistribute GPL software without source (whether bundled with other software or otherwise) is a restriction on whether/how you can redistribute GPL software, hence it goes against the "shall not restrict" requirement. And there's no exception carved out for "restrictions that require source code to be included".

So all of what you said there is simply incorrect, because like I said, you absolutely can distribute GPL software without including the source code alongside it. And that is what is done by everyone 99% of the time.

You only need to provide the source upon request to the people that ask you for it.

I encourage you to take the time to read the GPL FAQ. Even though GPL is not my preferred license I think it is important to have a good understanding of it. https://www.gnu.org/licenses/gpl-faq.en.html


>> namely: that it must be with source code

> GPL does not say that. What it does say is that you must provide the source upon request.

Yes, I was being brief. I'm well aware. [1]

[1] https://news.ycombinator.com/item?id=17202439

-------------

To your edit:

> So all of what you said there is simply incorrect, because like I said, you absolutely can distribute GPL software without including the source code alongside it. And that is what is done by everyone 99% of the time. You only need to provide the source upon request to the people that ask you for it.

No, it makes no difference at all. You cannot redistribute the software unless you are willing and able to redistribute the source code as well. That is very clearly a restriction on your redistribution of the software. The fact that we happen to be talking about the software's own source code makes no difference as to whether it's a restriction or not. It'd be a restriction whether we're talking about "source code", or "$100,000", or anything else. The simple fact that you have to be willing and able to provide {something} before you can redistribute the software is obviously a restriction on your redistribution of the software.


“Open source” means a lot more than “you may look at the source.”

https://opensource.org/osd


"Why Open Source misses the point of Free Software" by Richard Stallman

> When we call software “free,” we mean that it respects the users' essential freedoms: the freedom to run it, to study and change it, and to redistribute copies with or without changes. This is a matter of freedom, not price, so think of “free speech,” not “free beer.”

https://www.gnu.org/philosophy/open-source-misses-the-point....


There's a lot of interesting history between the Open Source and Free Software movements, and this is a valuable view of one side of the story. I'd recommend it for people who are interested, though I hope they would also seek out the other side.

However, we're going off topic here. The self-driving download is neither Open Source nor Free Software.


Truly gracious reply. Well met. :-)


BDD = Berkeley Deep Drive (https://deepdrive.berkeley.edu/)


The license seems a bit odd. It refers to software, but I can't see any software in any of the downloads - just data. Clearly the license is intended to cover the contents of the downloads, but the wording seems wrong then.

I'm not a lawyer, so maybe someone with more expertise could chime in...


I am curious, does training the AI on other driving data sets help? What I means is, not just sedan data set. Trucks, buses, maybe 2 wheelers? Would this help the model generalize more and make better prediction of how other vehicles work or would it just add noise?


It helps a lot in my experience. So in simulation I tried this with imitation learning by training on a hood camera, a camera at the height of what a semi-truck hood would be, and another camera offset to the left 1.5m along with steering and throttle for labels. I also added random noise to the position (less than a meter), rotation (less than degree), fov (less than a degree), capture height (< 1%), and capture width (< 1%). The result was a 3x higher average score on a driving benchmark where the score was meters driven minus seconds taken, second-meters of lane deviation, and seconds where acceleration surpassed 0.5g forces (to measure comfort). The dataset, training code, and sim are at deepdrive.io - a different entity with the same name :)


This made me wonder, how will driver-less cars interpret roads that have very faded, non-existent, or unique lane markings? I imagine rural, park, construction, and weather like snow/dust can really obscure things.


Tangent - does anyone know if Waymo is user-generated driving data from Navigation mode on Google Maps for their self-driving data? Or if that would even be feasible or useful?


GPS + IMU data from phones might be useful but neither is particularly accurate compared to video data. Maybe the IMU data would be useful to make the car feel more "human"


I'm confused...are the datasets free (as in for profit and nonprofit) to use and the pretrained models (mentioned in the paper) are under the UC license?


Does it include accidents and incidents with correct maneuvers? That's the training set hard to come by.


This is nice, but it assumes that grinding on lots of successful driving video is a valid path to automatic driving.


Presumably lots of successful driving video is at least a prerequisite to validating an automatic driving system, even if you're not using it for training?


Is there a similar dataset for lidar? My understanding was that lidar is more important than video. Has it changed?


Before I register to download the data, is there a smaller dataset for you to play with on the portal? I've been itching to do something fun after taking my SDC from Udacity. Nut1.8TB is way more than I can handle right now. Can someone upload a portion of this? (<10GB)?


I am downloading this dataset. It will take about 40 days, is my guess....


How would one go about creating a torrent for it? Or uploading to IPFS?


I already emailed the creator a couple weeks back to request / offer a torrent, but haven't heard anything back.

The problem here is that both of your suggestions involve a 2-step process:

1. Download the file

2. Create a torrent from it, or upload it to IPFS

Since step 1 is already a 2TB download, getting to either version of step 2 is untenable. I agree with one of the other posters in this thread, the default for something like this should be torrent since you get both distribution and checksumming for free.

It would also be nice if it wasn't a 2TB zip file, which then has to be unzipped onto another 2TB of storage for practical use.


Subject to licensing, we intend to make the dataset available (along with loads of other big datasets for ML) using a bit-torrent like program called Dela for the Hops Hadoop platform. Maybe in 3 weeks or so, it will all be released - with this dataset. Dela integrates with HDFS/S3/GCS backends, and it supports NAT traversal, and a delay-based congestion control over UDP - good for high bandwidth/high latency networks. See http://www.hops.io and our paper - https://ieeexplore.ieee.org/document/7980225/


good luck downloading them with 10KB/s.


[flagged]


Next up, public bus footage, if all the ducks become aligned.




Applications are open for YC Summer 2019

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: