Hacker News new | comments | show | ask | jobs | submit login
Earth on AWS – Open geospatial data (amazon.com)
663 points by thecodeboy 79 days ago | hide | past | web | favorite | 89 comments

Looking at the comments, most people don't understand what this is.

In the geospatial industry, there are many organizations that produce free open data.

For example the NAIP image data comes USDA and has been paid for by the US govt so the city/state can used it for agriculture - hence why the images are not just RGB, but they also include an infrared band so they can be used for agriculture algorithms like NDVI results. For that particular dataset the license is very liberal. In case you are curious about that particular problem, you can find more info here: https://www.fsa.usda.gov/programs-and-services/aerial-photog...

The problem with dealing with datasets of this size is that just the mere collection and storage of it, is a problem of resources. This AWS link here is saying that they have grabbed all these datasets from various govt and non-profits and are hosting them in raw form so you can use them. Because the data comes from so many different institutions, the license is different - but practically speaking super liberal.

It is not competing with any previous commercial service from any vendor, nor it is meant to be a solution of any kind... Just big public spatial datasets hosted at AWS.

I want to give another example: I'm currently updating the Terrain Tiles dataset [0] and a very significant portion of my time was spent working around terrible government data download websites or incompatible distribution formats.

For example, the UK Government flew some excellent LIDAR data missions generating a very high resolution elevation model for most of the country and then put it behind a terrible website where you have to click 3 or 4 times to get a small piece of the data. After a couple hours I built a script to download all the pieces and put them back together into usable sized downloads [1]. Mexico's INEGI has a similar situation, so I had to dig through that to build a scraper [2]. USGS's EarthExplorer uses a terrible shopping cart metaphor for download [3].

All that is to say that the interesting piece with Earth on AWS is that this is public data that smart people are putting in a more easily accessible place for mass consumption and AWS is footing the bill. In return AWS is getting people more interested in AWS products and a set of customers that are more knowledgeable about how to process data "in the cloud".

[0] https://aws.amazon.com/public-datasets/terrain/

[1] https://github.com/iandees/uk-lidar

[2] https://github.com/iandees/mx-lidar

[3] https://earthexplorer.usgs.gov/

Would you mind if I stuck your collections of data in the Internet Archive? I appreciate AWS' efforts in this regard, but trust the Internet Archive for access and persistence more (and the Archive serves every object as a torrent).

It'd be great to have this data in Internet Archive. If you check out our GitHub repo [0] you can see where we coordinate finding data sources. The data I download gets composited into tiles for display on maps, so we're updating ~4 billion objects in S3.

The source data is probably the better thing to include in IA, and the GitHub repo is probably the best place to find how to mirror it. If you've got time to spend on it, you might post an issue in there and I can help point you in the right direction.

[0] https://github.com/tilezen/joerd/issues?q=is%3Aissue+is%3Aop...

Unrelated to AWS, but sites like you describe are what we used in "How not to do It" while justifying the (very minor) expense in building the Alaska Elevation Archive. http://elevation.alaska.gov/

The National Map, from the USGS, is another example of "ugh", at least it was when it was tile-by-tile download only.

Basically a glorified mirror with some marketing/use cases.

If you had ever compiled some of these datasets you would realize that it was impossible to get the datasets without extremely complicated scrappers that generated jobs, waited for a tiny little portion to be extracted (like 0.1% of it), and download the result from an ftp location that had non-standardized names... or how you would end up just giving up, calling people on the phone to ultimately get the person in charge to return the call to you, and buy pre-filled HDs from them that would get mailed to you eventually.

This is far more than a glorified mirror.

I realize yes some of these sources have small budgets, others have million dollar budgets, yet haven't solved the "cheap storage & distribution" in a way Amazon has. It is a case of each to his own, Amazon is able to help in their way.

My comment was more a warning to not expect this to be any easier to access than the original data in the way of manipulating data.

Also, call me cynical but this is about running machine learning on impossibly large datasets meaning huge profits for AWS.

Win for AWS and win for customers who can easily access that huge amount of data just via s3 buckets.

...or anywhere else you want.

Looks like a lot of this data comes from free sources. It's not clear from their site what the licensing is though.

How does this compare to other offerings like Google Earth Engine[1], GCP Landsat[2], or GCP Sentinel-2[3]?

[1] https://earthengine.google.com/

[2] https://cloud.google.com/storage/docs/public-datasets/landsa...

[3] https://cloud.google.com/storage/docs/public-datasets/sentin...

In case you don't scroll all the way down, there's a list of articles and video's titled "Use cases" at the bottom of the page which appear to cover (at least) how some of this data has been used.

I wish all technology announcements included that and put it up front. I'm often surprised at the number of interesting-sounding things I follow from HN's front page only to end up with no idea of what they're good for or why I might want to invest time in learning about them.

Your most enthusiastic customers can sometimes be the people who didn't know what was possible until your product came along.

> and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day.

They're really serious about counting, aren't they.

Dodge, duck, dip, dive and dodge!

I'd like to see the National Snow and Ice Data Center's data (soil moisture, sea ice cover/concentration, snow cover [looks like MODIS is already available], permafrost, glacier outlines) on AWS.

I know there are people there that want to see it happen, but it's a matter of cost. What incentives/programs does Earth on AWS offer to assist stewards of public data to make it available on AWS?

Additionally, I think some of this data is normally behind URS/Earthdata Login, what did the politics of making the data available on AWS without URS look like?

I had the chance to speak with the/a rep from Amazon who at the time was working to make the Landsat8 data available on AWS exactly a year ago at a conference. From what I remember, AWS covers the hosting cost of the datasets in exchange for being able to incentivize the use of AWS in working with them. The data storage and transfer costs, as well as logistics are enormous. I don't recall how the transfer was being managed but it was certainly describe more or less as a partnership.

NOAA is working on making this happen through the big data project: http://www.noaa.gov/big-data-project

On the NOAA side, there tend not to be loginwalls so that hasn't been much of a concern.

I work with one of the partners on this project, so if you have specific datasets or use case ideas feel free to drop me an email at zflamig uchicago.edu.

Exploring thru these datasets can be quite addictable. Especially with service like http://apps.sentinel-hub.com/sentinel-playground/

Would love to see Amazon make an integration for Unreal Engine 4 or their Lumberyard video game engine so a game developer can easily import detailed swaths of the earth.

This already exists with Unity via Mapbox - https://www.mapbox.com/unity/

Very cool, will need to check this out!

What I'm psyched about is OpenStreetMaps data queryable with Athena. It's traditionally kind of a pain to convert PBFs to a queryable format.

Have you looked at Overpass API?

(it provides direct access to OSM data using a DSL: http://wiki.openstreetmap.org/wiki/Overpass_API/Overpass_QL )

For tiny purposes the public servers are sufficient and there seem to be quite a few people running private servers.

In case you missed it, we just added support for 2D Geospatial Queries in Amazon Athena:


PS: I'm on the Athena team

Out of pure curiosity, how so? I deal with Protobuf regularly, and as long as a decent library exists to dump to JSON that is domain specific to your use case it is trivial. Is that the only thing missing here?

Global OSM is 40Gb or so - there are various libraries to translate it but as you can imagine, the sheer size of the dataset causes challenges. You also have to make choices about how you translate the attributes - for example, if you want to pull certain tags from the key:value field into separate columns in a table. Yet another issue involves source and target geometries - there can be inconsistencies in how features of the same type are recorded in OSM in terms of geometry, and so getting disparate input types translated into a single output type involves choices. Yes you can easily (after a wait!) get global OSM translated into something else, but making that something else exactly what you need can take effort.

For starters, the OSM PBF file format is not a protobuf file! Instead it's a collection of protobuf files inside each other!

You can read more in the fileformat: https://wiki.openstreetmap.org/wiki/PBF_Format

There are other problems, specific to OSM and not PBF/protobuf, like needing to store the locations of nodes until the end of file because they could be referenced anywhere in the file.

Wow what great timing! Just as we are scaling up our imagery DL projects, this is cool!

I wonder how this compares to Planet Labs dataset.

Are you referring to their Open California dataset?


It's larger as Open California only has the datasets from Landsat 8, Sentinel, and Planet's own satellites.

Wow...that’s a treasure trove of useful data all in one place. Major thanks to Amazon.

Here is another source https://earthexplorer.usgs.gov/. This one is the RAW data for many of the tilesets.

This makes me want to take this open source weather forecasting model and run it on AWS. http://planetwrf.com

People are doing this already. See https://depts.washington.edu/learnit/techconnect/cloudday/wo... for some good info on this.

> The planetWRF model is no longer available for downloading.


On a somewhat related topic - can anyone recommend a geocoder available through AWS?

There are several AWS marketplace solutions available on the link at the bottom of the original article.[1] Only Geolytica and Forward Geocoder seem to be available to new customers, and both have < 5 reviews.

[1] https://aws.amazon.com/mp/gis/#geocoding

Might not be what you are looking for but https://wiki.openstreetmap.org/wiki/Nominatim is a geocoder that runs on OSM data.

Pelias (and Mapzen Search) is so much better: http://pelias.io/

Thanks for sharing. Geocode and RevGeo are generally considered a Hard Problem (TM) in GIS so it is nice to see great projects such as this.


I'm one of the makers of the OpenCage Geocoder: https://geocoder.opencagedata.com

We provide a single, simple API that behind the scenes aggregates numerous open geocoders, including nominatim, DSTK, and others. Please give us a try, there is a free testing tier you can use as long as you like.

You might want to check out the Data Science Toolkit


Slightly tangential, but is there a "modern" alternative to GDAL for working with raster data?

The last time I tried, stitching together tiles and cutting it to state boundaries took an inordinate amount of time (upwards of 15 minutes for 6 tiles from Landsat-7/8). Though, I'm half convinced it was because I was doing something very suboptimal..

Also, iirc, it was single threaded.

No, GDAL is still the best. I also suspect you were doing something suboptimal. As far as modern wrappers for GDAL, `rasterio` is the most pythonic. Part of sgillies suite including shapely, rasterio, and fiona.

What are some interesting things to do with this?

In my lab there is a masters student working on monitoring deforestation for palm fields in Indonesia using Google Earth Engine, which is similar to Earth on AWS. There is a whole scientific field devoted to analysing this kind of data: remote sensing. It's underrated in the hacker community honestly.

Geo as a whole in my observation. There are decades of research effort that have got us to where we are now, well developed study programmes worldwide, advanced proprietary and open software and data available, and geo is effectively mainstream in google maps and sat nav. Yet for some reason this contextual background is missed by many, and so I see commenters making statements about what a leap forward this or that is, when in fact it's just part of an evolving history. I'm not criticising people for not knowing what they don't know - my question is - why does it seem like geo as a whole has trouble communicating this context? I wonder whether there are any other areas of tech that suffer from this lack of awareness?

That sounds really interesting. Does your colleague blog — or are there remote sensing tech blogs to follow?

Various financial analysis' such as counting the number of cars in retailers parking lots, looking for crop shortages among commodity traders, estimating damage from natural disasters to estimate insurance company's exposure.

I'm sure there's also more altruistic uses, such as providing better forcasting and advise for farmers in developing countries.

Disaster response is one of the altruistic cases. You can use deep learning to measure impact of hurricanes and better allocate resources

But this data can be months or years old, doesn't seem like this would have much value to financial analysts.

It depends on the revisit time, spatial resolution, region of interest, cloud coverage and product type. For example Landsat 8 images the entire Earth every 16 days, Sentinel-2 revisit time is ~5 days with 2 satellites and MODIS provides daily data but at moderate spatial resolution (> 250m). We expect both the spatial resolution and revisit time to improve as more companies are launching satellite constellations.

I did not appreciate that fact. Thanks

A friend of mine is using geospatial data to look for places in Mozambique with high probability of finding hominid fossils. He's basically automating Lee Rogers Berger's work of travelling Africa and looking for caves and other spots. It's pretty cool but still very preliminary.

Pardon if the question sounds dumb; will we have real time data of a certain region, for instance getting info about clouds?

Not real time, but the Landsat, Sentinel-2, MODIS and GEOS data are all updated on a continuing basis.

GEOS are from geostationary satellites pointed at the US and are updated a couple times an hour:


GOES-16 imagery could be up to every 30 seconds over specific regions, every 5 minutes over CONUS, and every 15 minutes over entire disk now.

You have real-time NEXRAD data, where the reflectivity is basically cloud cover. NEXRAD of course only covers the US.

I'm always interested in these kinds of data.

A few months ago, I was looking at different open sources to geocode a lot of addresses around the world.

I have tried openstreetmap and some VM from datasciencetoolkit - both have poor results.

Are there other sources aside from Google? Google appears to be the most accurate.

Check out https://openaddresses.io

It has ~477 million freely-licensed addresses.

Will the datasets be open to contribution from members of the public or are these readonly mirrors? Seems like Blue Horizon and Prime Now amongst many other of their offerings that would be use cases for up to the minute data?

Looks like many of the datasets were obtained from federal organizations in which case it should actually be under public domain.

I am waiting for the day we can get satellite images that are so fine you can see people or animals.

You can get this now if you are military or have plenty of spare cash lying around. The problem is that to get this level of resolution your sensor needs to be nearer the earth, and thus your platform has a shorter lifespan because it will be subject to greater atmospheric drag, and thus its per-picture cost will be comparatively very high. This might prompt the question, why not put it farther away with a bigger lens? Well, there is an upper limit on the size/weight of the lens that you can lob up to any given orbit, and thus it's less feasible to get this level of resolution from a higher orbit. You also have the issue of swath width to think about - generally the higher your resolution the smaller your imaging area, which might limit the usefulness and thus the price you can charge for your imagery.

I think drone aerial imagery holds more promise than satellite imagery. Who knows though, perhaps with fancy new image processing algorithms and sensors we will get the level of resolution you are talking about from satellite imagery at reasonable cost over time.

Edit: or with bigger cheaper rockets.

This feels like another service Google can replicate and be much better at it considering their past experience.

They already provide Earth Engine platform which is more complete both in terms of available datasets as well as developer APIs.

AWS product basically consists of open remote sensing datasets uploaded to S3. This is convenient if you deploy on AWS (transfer costs) but still have to develop all the data processing.

Google doesn’t do open geodata: they’re the only major internet company not to be invested in OSM in some way, preferring instead to build up their own proprietary geo database.

How do Bing (Microsoft) and Apple invest in OSM respectively? Also I am guessing you are not including Facebook? Or do they contribute as well?

Facebook is doing a lot of ML work on automated tracing from aerial imagery, which has great potential for mapping terra incognita in OSM. There are several bumps to be ironed out with the community but they're engaging well.

Bing has given OSM tracing rights over its aerial imagery for several years now - it's hard to overestimate how significant that is.

Apple (as you'd expect) has been much more private about its OSM work, but it has numerous people working on it and has been surfacing OSM data in Apple Maps in several parts of the world.

Facebook and Bing were both Gold sponsors of the most recent State of the Map US: https://2017.stateofthemap.us though in fairness Google was a bronze sponsor

Bing was a gold sponsor of the most recent State of the Map (global conference) in Japan, Facebook silver: http://2017.stateofthemap.org

Bing has for years allowed the use of their sat images for OSM tracing.

Bing has allowed OSMers to trace from their aerial imagery for years. Having proper aerial imagery is like night and day for mapping.

It used to be Yahoo aerial in the early days.

Sure, they can replicate much better and then shut it down later on.

Imagine if they'd never shut anything down.

People would be constantly popping up on HN repeating a different mantra about how bloated and unfocused they were (although arguably Google manages to be bloated and unfocused despite the shutdowns - but that's another debates)

Most of the Google shutdown's were understandable whether you like them or not.

In any case - it's getting tedious to hear the same comment on every Google related post. They shut things down. We get it.

    > People would be constantly popping
    > up on HN repeating a different mantra
    > about how bloated and unfocused they
    > were
I don't see that for AWS.

All the services on AWS fulfill a need. Google often starts projects without direct profitability in their mind. Note that I'm talking about Google as a whole and not Google Cloud.

I'd say Google is a lot more liberal in starting new projects than any other company of its size. Look around in the news and you'll see how analysts comment about how Google has no 'direction' and is burning money.

I'm actually starting to find it quite amusing.

I mean the second they decide to license their Google Earth data including the cleanups they do on it, their offering will be unparalleled. There's no other org who has gone through this expensive process other than google.

Even if they decide to shut it down soon, the value companies and scientists get out of it will be worth it.

Presumably you are talking about the vector dataset? I think most of the raster imagery comes from commercial aerial imagery sources (certainly they have no monopoly whatsoever on that). I think there are other global vector datasets that are broadly comparable, no? Streetview excepted.

Maybe also worth pointing out that Google's record in geo isn't without its failures.

> the second they decide to license their Google Earth data including the cleanups they do on it

Haha. Good luck with that. Google doesn't give its geodata away.

We'd all be surprised, sure. But Google's biggest bet is cloud and I think they are willing to sacrifice a few things for the big win.

You mean like spend a billion dollars buying and building a satellite imagery company and then liquidate it to a competitor for equity in their company?

I know this will be cool if combined with machine learning but to do what? :)

lol is this a serious comment??? You have terabytes (petabytes) or data in front you of but cant think of a single thing to do with it????

Oh i know, we'll just 'machine learning' our 'big data' and get great business insights.

Look at cars trends of cars parked on streets.

how can you keep everyone safe if you can't see where they are, what they are doing or what is inside their head?

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact