Hacker News new | comments | show | ask | jobs | submit login

Looking at the comments, most people don't understand what this is.

In the geospatial industry, there are many organizations that produce free open data.

For example the NAIP image data comes USDA and has been paid for by the US govt so the city/state can used it for agriculture - hence why the images are not just RGB, but they also include an infrared band so they can be used for agriculture algorithms like NDVI results. For that particular dataset the license is very liberal. In case you are curious about that particular problem, you can find more info here: https://www.fsa.usda.gov/programs-and-services/aerial-photog...

The problem with dealing with datasets of this size is that just the mere collection and storage of it, is a problem of resources. This AWS link here is saying that they have grabbed all these datasets from various govt and non-profits and are hosting them in raw form so you can use them. Because the data comes from so many different institutions, the license is different - but practically speaking super liberal.

It is not competing with any previous commercial service from any vendor, nor it is meant to be a solution of any kind... Just big public spatial datasets hosted at AWS.

I want to give another example: I'm currently updating the Terrain Tiles dataset [0] and a very significant portion of my time was spent working around terrible government data download websites or incompatible distribution formats.

For example, the UK Government flew some excellent LIDAR data missions generating a very high resolution elevation model for most of the country and then put it behind a terrible website where you have to click 3 or 4 times to get a small piece of the data. After a couple hours I built a script to download all the pieces and put them back together into usable sized downloads [1]. Mexico's INEGI has a similar situation, so I had to dig through that to build a scraper [2]. USGS's EarthExplorer uses a terrible shopping cart metaphor for download [3].

All that is to say that the interesting piece with Earth on AWS is that this is public data that smart people are putting in a more easily accessible place for mass consumption and AWS is footing the bill. In return AWS is getting people more interested in AWS products and a set of customers that are more knowledgeable about how to process data "in the cloud".

[0] https://aws.amazon.com/public-datasets/terrain/

[1] https://github.com/iandees/uk-lidar

[2] https://github.com/iandees/mx-lidar

[3] https://earthexplorer.usgs.gov/

Would you mind if I stuck your collections of data in the Internet Archive? I appreciate AWS' efforts in this regard, but trust the Internet Archive for access and persistence more (and the Archive serves every object as a torrent).

It'd be great to have this data in Internet Archive. If you check out our GitHub repo [0] you can see where we coordinate finding data sources. The data I download gets composited into tiles for display on maps, so we're updating ~4 billion objects in S3.

The source data is probably the better thing to include in IA, and the GitHub repo is probably the best place to find how to mirror it. If you've got time to spend on it, you might post an issue in there and I can help point you in the right direction.

[0] https://github.com/tilezen/joerd/issues?q=is%3Aissue+is%3Aop...

Unrelated to AWS, but sites like you describe are what we used in "How not to do It" while justifying the (very minor) expense in building the Alaska Elevation Archive. http://elevation.alaska.gov/

The National Map, from the USGS, is another example of "ugh", at least it was when it was tile-by-tile download only.

Basically a glorified mirror with some marketing/use cases.

If you had ever compiled some of these datasets you would realize that it was impossible to get the datasets without extremely complicated scrappers that generated jobs, waited for a tiny little portion to be extracted (like 0.1% of it), and download the result from an ftp location that had non-standardized names... or how you would end up just giving up, calling people on the phone to ultimately get the person in charge to return the call to you, and buy pre-filled HDs from them that would get mailed to you eventually.

This is far more than a glorified mirror.

I realize yes some of these sources have small budgets, others have million dollar budgets, yet haven't solved the "cheap storage & distribution" in a way Amazon has. It is a case of each to his own, Amazon is able to help in their way.

My comment was more a warning to not expect this to be any easier to access than the original data in the way of manipulating data.

Also, call me cynical but this is about running machine learning on impossibly large datasets meaning huge profits for AWS.

Win for AWS and win for customers who can easily access that huge amount of data just via s3 buckets.

...or anywhere else you want.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact