Looking at the comments, most people don't understand what this is.
In the geospatial industry, there are many organizations that produce free open data.
For example the NAIP image data comes USDA and has been paid for by the US govt so the city/state can used it for agriculture - hence why the images are not just RGB, but they also include an infrared band so they can be used for agriculture algorithms like NDVI results. For that particular dataset the license is very liberal. In case you are curious about that particular problem, you can find more info here: https://www.fsa.usda.gov/programs-and-services/aerial-photog...
The problem with dealing with datasets of this size is that just the mere collection and storage of it, is a problem of resources. This AWS link here is saying that they have grabbed all these datasets from various govt and non-profits and are hosting them in raw form so you can use them. Because the data comes from so many different institutions, the license is different - but practically speaking super liberal.
It is not competing with any previous commercial service from any vendor, nor it is meant to be a solution of any kind... Just big public spatial datasets hosted at AWS.
I want to give another example: I'm currently updating the Terrain Tiles dataset [0] and a very significant portion of my time was spent working around terrible government data download websites or incompatible distribution formats.
For example, the UK Government flew some excellent LIDAR data missions generating a very high resolution elevation model for most of the country and then put it behind a terrible website where you have to click 3 or 4 times to get a small piece of the data. After a couple hours I built a script to download all the pieces and put them back together into usable sized downloads [1]. Mexico's INEGI has a similar situation, so I had to dig through that to build a scraper [2]. USGS's EarthExplorer uses a terrible shopping cart metaphor for download [3].
All that is to say that the interesting piece with Earth on AWS is that this is public data that smart people are putting in a more easily accessible place for mass consumption and AWS is footing the bill. In return AWS is getting people more interested in AWS products and a set of customers that are more knowledgeable about how to process data "in the cloud".
Would you mind if I stuck your collections of data in the Internet Archive? I appreciate AWS' efforts in this regard, but trust the Internet Archive for access and persistence more (and the Archive serves every object as a torrent).
It'd be great to have this data in Internet Archive. If you check out our GitHub repo [0] you can see where we coordinate finding data sources. The data I download gets composited into tiles for display on maps, so we're updating ~4 billion objects in S3.
The source data is probably the better thing to include in IA, and the GitHub repo is probably the best place to find how to mirror it. If you've got time to spend on it, you might post an issue in there and I can help point you in the right direction.
Unrelated to AWS, but sites like you describe are what we used in "How not to do It" while justifying the (very minor) expense in building the Alaska Elevation Archive. http://elevation.alaska.gov/
The National Map, from the USGS, is another example of "ugh", at least it was when it was tile-by-tile download only.
If you had ever compiled some of these datasets you would realize that it was impossible to get the datasets without extremely complicated scrappers that generated jobs, waited for a tiny little portion to be extracted (like 0.1% of it), and download the result from an ftp location that had non-standardized names... or how you would end up just giving up, calling people on the phone to ultimately get the person in charge to return the call to you, and buy pre-filled HDs from them that would get mailed to you eventually.
I realize yes some of these sources have small budgets, others have million dollar budgets, yet haven't solved the "cheap storage & distribution" in a way Amazon has. It is a case of each to his own, Amazon is able to help in their way.
My comment was more a warning to not expect this to be any easier to access than the original data in the way of manipulating data.
In case you don't scroll all the way down, there's a list of articles and video's titled "Use cases" at the bottom of the page which appear to cover (at least) how some of this data has been used.
I wish all technology announcements included that and put it up front. I'm often surprised at the number of interesting-sounding things I follow from HN's front page only to end up with no idea of what they're good for or why I might want to invest time in learning about them.
Your most enthusiastic customers can sometimes be the people who didn't know what was possible until your product came along.
> and identifies the people, locations, organizations, counts, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day.
They're really serious about counting, aren't they.
I'd like to see the National Snow and Ice Data Center's data (soil moisture, sea ice cover/concentration, snow cover [looks like MODIS is already available], permafrost, glacier outlines) on AWS.
I know there are people there that want to see it happen, but it's a matter of cost. What incentives/programs does Earth on AWS offer to assist stewards of public data to make it available on AWS?
Additionally, I think some of this data is normally behind URS/Earthdata Login, what did the politics of making the data available on AWS without URS look like?
I had the chance to speak with the/a rep from Amazon who at the time was working to make the Landsat8 data available on AWS exactly a year ago at a conference. From what I remember, AWS covers the hosting cost of the datasets in exchange for being able to incentivize the use of AWS in working with them. The data storage and transfer costs, as well as logistics are enormous. I don't recall how the transfer was being managed but it was certainly describe more or less as a partnership.
On the NOAA side, there tend not to be loginwalls so that hasn't been much of a concern.
I work with one of the partners on this project, so if you have specific datasets or use case ideas feel free to drop me an email at zflamig uchicago.edu.
Would love to see Amazon make an integration for Unreal Engine 4 or their Lumberyard video game engine so a game developer can easily import detailed swaths of the earth.
Out of pure curiosity, how so? I deal with Protobuf regularly, and as long as a decent library exists to dump to JSON that is domain specific to your use case it is trivial. Is that the only thing missing here?
There are other problems, specific to OSM and not PBF/protobuf, like needing to store the locations of nodes until the end of file because they could be referenced anywhere in the file.
Global OSM is 40Gb or so - there are various libraries to translate it but as you can imagine, the sheer size of the dataset causes challenges. You also have to make choices about how you translate the attributes - for example, if you want to pull certain tags from the key:value field into separate columns in a table. Yet another issue involves source and target geometries - there can be inconsistencies in how features of the same type are recorded in OSM in terms of geometry, and so getting disparate input types translated into a single output type involves choices. Yes you can easily (after a wait!) get global OSM translated into something else, but making that something else exactly what you need can take effort.
On a somewhat related topic - can anyone recommend a geocoder available through AWS?
There are several AWS marketplace solutions available on the link at the bottom of the original article.[1] Only Geolytica and Forward Geocoder seem to be available to new customers, and both have < 5 reviews.
We provide a single, simple API that behind the scenes aggregates numerous open geocoders, including nominatim, DSTK, and others. Please give us a try, there is a free testing tier you can use as long as you like.
Slightly tangential, but is there a "modern" alternative to GDAL for working with raster data?
The last time I tried, stitching together tiles and cutting it to state boundaries took an inordinate amount of time (upwards of 15 minutes for 6 tiles from Landsat-7/8). Though, I'm half convinced it was because I was doing something very suboptimal..
No, GDAL is still the best. I also suspect you were doing something suboptimal. As far as modern wrappers for GDAL, `rasterio` is the most pythonic. Part of sgillies suite including shapely, rasterio, and fiona.
In my lab there is a masters student working on monitoring deforestation for palm fields in Indonesia using Google Earth Engine, which is similar to Earth on AWS. There is a whole scientific field devoted to analysing this kind of data: remote sensing. It's underrated in the hacker community honestly.
Geo as a whole in my observation. There are decades of research effort that have got us to where we are now, well developed study programmes worldwide, advanced proprietary and open software and data available, and geo is effectively mainstream in google maps and sat nav. Yet for some reason this contextual background is missed by many, and so I see commenters making statements about what a leap forward this or that is, when in fact it's just part of an evolving history. I'm not criticising people for not knowing what they don't know - my question is - why does it seem like geo as a whole has trouble communicating this context? I wonder whether there are any other areas of tech that suffer from this lack of awareness?
Various financial analysis' such as counting the number of cars in retailers parking lots, looking for crop shortages among commodity traders, estimating damage from natural disasters to estimate insurance company's exposure.
I'm sure there's also more altruistic uses, such as providing better forcasting and advise for farmers in developing countries.
It depends on the revisit time, spatial resolution, region of interest, cloud coverage and product type. For example Landsat 8 images the entire Earth every 16 days, Sentinel-2 revisit time is ~5 days with 2 satellites and MODIS provides daily data but at moderate spatial resolution (> 250m).
We expect both the spatial resolution and revisit time to improve as more companies are launching satellite constellations.
A friend of mine is using geospatial data to look for places in Mozambique with high probability of finding hominid fossils. He's basically automating Lee Rogers Berger's work of travelling Africa and looking for caves and other spots. It's pretty cool but still very preliminary.
Will the datasets be open to contribution from members of the public or are these readonly mirrors? Seems like Blue Horizon and Prime Now amongst many other of their offerings that would be use cases for up to the minute data?
You can get this now if you are military or have plenty of spare cash lying around. The problem is that to get this level of resolution your sensor needs to be nearer the earth, and thus your platform has a shorter lifespan because it will be subject to greater atmospheric drag, and thus its per-picture cost will be comparatively very high. This might prompt the question, why not put it farther away with a bigger lens? Well, there is an upper limit on the size/weight of the lens that you can lob up to any given orbit, and thus it's less feasible to get this level of resolution from a higher orbit. You also have the issue of swath width to think about - generally the higher your resolution the smaller your imaging area, which might limit the usefulness and thus the price you can charge for your imagery.
I think drone aerial imagery holds more promise than satellite imagery. Who knows though, perhaps with fancy new image processing algorithms and sensors we will get the level of resolution you are talking about from satellite imagery at reasonable cost over time.
They already provide Earth Engine platform which is more complete both in terms of available datasets as well as developer APIs.
AWS product basically consists of open remote sensing datasets uploaded to S3. This is convenient if you deploy on AWS (transfer costs) but still have to develop all the data processing.
Google doesn’t do open geodata: they’re the only major internet company not to be invested in OSM in some way, preferring instead to build up their own proprietary geo database.
Facebook is doing a lot of ML work on automated tracing from aerial imagery, which has great potential for mapping terra incognita in OSM. There are several bumps to be ironed out with the community but they're engaging well.
Bing has given OSM tracing rights over its aerial imagery for several years now - it's hard to overestimate how significant that is.
Apple (as you'd expect) has been much more private about its OSM work, but it has numerous people working on it and has been surfacing OSM data in Apple Maps in several parts of the world.
Facebook and Bing were both Gold sponsors of the most recent State of the Map US: https://2017.stateofthemap.us though in fairness Google was a bronze sponsor
Bing was a gold sponsor of the most recent State of the Map (global conference) in Japan, Facebook silver: http://2017.stateofthemap.org
Bing has for years allowed the use of their sat images for OSM tracing.
People would be constantly popping up on HN repeating a different mantra about how bloated and unfocused they were (although arguably Google manages to be bloated and unfocused despite the shutdowns - but that's another debates)
Most of the Google shutdown's were understandable whether you like them or not.
In any case - it's getting tedious to hear the same comment on every Google related post. They shut things down. We get it.
All the services on AWS fulfill a need. Google often starts projects without direct profitability in their mind. Note that I'm talking about Google as a whole and not Google Cloud.
I'd say Google is a lot more liberal in starting new projects than any other company of its size. Look around in the news and you'll see how analysts comment about how Google has no 'direction' and is burning money.
I mean the second they decide to license their Google Earth data including the cleanups they do on it, their offering will be unparalleled. There's no other org who has gone through this expensive process other than google.
Even if they decide to shut it down soon, the value companies and scientists get out of it will be worth it.
Presumably you are talking about the vector dataset? I think most of the raster imagery comes from commercial aerial imagery sources (certainly they have no monopoly whatsoever on that). I think there are other global vector datasets that are broadly comparable, no? Streetview excepted.
Maybe also worth pointing out that Google's record in geo isn't without its failures.
You mean like spend a billion dollars buying and building a satellite imagery company and then liquidate it to a competitor for equity in their company?
In the geospatial industry, there are many organizations that produce free open data.
For example the NAIP image data comes USDA and has been paid for by the US govt so the city/state can used it for agriculture - hence why the images are not just RGB, but they also include an infrared band so they can be used for agriculture algorithms like NDVI results. For that particular dataset the license is very liberal. In case you are curious about that particular problem, you can find more info here: https://www.fsa.usda.gov/programs-and-services/aerial-photog...
The problem with dealing with datasets of this size is that just the mere collection and storage of it, is a problem of resources. This AWS link here is saying that they have grabbed all these datasets from various govt and non-profits and are hosting them in raw form so you can use them. Because the data comes from so many different institutions, the license is different - but practically speaking super liberal.
It is not competing with any previous commercial service from any vendor, nor it is meant to be a solution of any kind... Just big public spatial datasets hosted at AWS.