A journey of optimization of cloud-based geospatial data processing

jandrewrogers · 2024-12-15T16:57:07 1734281827

As a meta-topic, the standard formats for geospatial are virtually all deeply suboptimal to the point of being arguably broken for modern large-scale geospatial analysis. What we have is a giant collection of narrow solutions to narrow problems, none of which efficiently interoperate with each other. It hasn’t modernized much at all, the tool chain is basically the same tech we had 25 years ago with a fresh coat of paint.

Fixing this is a difficult and extremely deep engineering and computer science problem, which is why there has been so little progress. Most improvements have been driven by users, but they aren’t able to make the investments to fix the problems underneath their problem, they have other work to do. Any fundamental improvements would materially break existing workflows, so adoption would be slow at best.

The entire spatial domain is stuck in a deep local minima.

sid_tf · 2024-12-15T17:13:14 1734282794

I share the same feelings, hence I along with OP, made this attempt of creating a library based on whatever seemed most performant and closest to regular data engineering world.

1. Geoparquet for metadata (iceberg catalog could be added later if scale is extremely large) 2. Cloud Optimized Geotiff (COG) for image data (which is what NASA and ESA have been pumping for last 5 years and still do) 3. and an efficient lightweight library that can quickly grab pieces of data from 100s or 1000s of raster files parallely.

I did not want to attempt creating another format, coz I felt that we as geocommunity are at a stage where such tools and files put together forms a pretty good first level foundation to build on. No need for heavy legacy tools, closer to regular data (with parquet) and closer to duckdb world.

I feel adoption may be good, if this is the thesis / guiding principles.

Baby steps surely, but good steps I believe, and this is all thanks to the open source community.

scottcha · 2024-12-15T16:15:25 1734279325

I wonder how this compares to the use of zarr format which seems to have similar goals and design and is already well integrated in several libraries (particularly xarray)?

sid_tf · 2024-12-15T16:58:25 1734281905

Hi, Sid here from Terrafloww, essentially this blog and library's attempt is based on the fact that COG has emerged as the standard format used by NASA and ESA and others over the past 5 years. And there is way more data in COG and continually being produced as COGs still. So we wanted to make sure that we use the most prevalent format as efficiently and quickly as possible. We didn't try to create a new format, I believe thats best left to the large and open community itself to decide.

Thanks for the question! Happy to answer more.

whinvik · 2024-12-15T07:52:17 1734249137

As someone who deals with geospatial processes like this daily, I have 2 notes. 1. STAC implementations are already complicated. Not everyone has good catalogues, or that work uniformly well for all types of queries. 2. Using STAC geoparquet on top and then another layer on top would mean we have to self host yet another catalog, essentially a new standard.

In short, even though I believe STAC adoption is what we should aim for, in reality we usually end up building workarounds.

kbgg · 2024-12-15T16:17:37 1734279457

Totally agree. When I worked at Radiant Earth the most frequent question we got was "I have this data I want to share, how do I create a good STAC for it?"

It was a totally valid question but one that's practically impossible to answer. As a result there's just so much variation between STACs and you never really know what you're going to get.

sid_tf · 2024-12-15T17:21:09 1734283269

Hi Sid here one of authors of blog. STAC being open and extensible, makes it a double edge sword yes. Quite refreshing to hear this from someone who was at Radiant coz, it shows we still haven't reached a great way of sharing data yet.

What do you think of attempts like Source Cooperative?

kbgg · 2024-12-15T17:55:57 1734285357

There's clearly a need for Source Cooperative given the overwhelming positive feedback we received during the beta. However, Source Cooperative is entirely dependent on Microsoft and Amazon subsidizing all of the S3 / Azure Blob Storage costs. They could pull the rug out at any moment, like we've seen with Planetary Computer, and Source Cooperative would no longer be sustainable.

Disclaimer: I built Source Cooperative and left Radiant 2 months ago.

sid_tf · 2024-12-15T17:04:53 1734282293

Agree on both fronts! STAC is pretty complex. My attempt here was to make raw data access easy and fast, not to solve STAC, which I believe stac-geoparquet basically makes an attempt to fix (makes it columnar and hence faster to query at scale).

And yes, having a parquet will add overhead of needing some form of catalog. But I believe we are very close to having Iceberg with native geo types being that catalog. at the same time, it opens another can of worms (databricks and other catalogs etc).

silver lining is that parquet (geoparquet) makes geo data closer to regular data.

whinvik · 2024-12-15T19:16:40 1734290200

Not sure I understand. The blog mentions adding columns to the Geoparquet so it's either an extension or a new standard.

Not going to discourage what you are doing but just reading the blog, my immediate instinct is to not try to check out what you build.

sid_tf · 2024-12-16T06:22:07 1734330127

Sure! Glad you shared your honest opinion. But I just want to reiterate that the blog and its subsequent library which will be released is not being done to create a new standard. All throughout the blog its been clearly stated that this is "a new approach", not "new standard" or "new format" or even "better standard".

lmc · 2024-12-15T17:53:38 1734285218

Interesting ideas. If you do some subsequent post on this, it would be great to see how the perf benefits scale with different sizes of AoIs.

sid_tf · 2024-12-16T06:23:42 1734330222

Thanks! I will be posting more on this, especially as we try out more use cases like ML dataset curation, globally distributed time series analysis, and more. Surely will take your feedback into account of having variations in AOI sizes. Here;s the terrafloww linkedin, you can find me on it too, https://www.linkedin.com/company/terrafloww-inc/