Hacker News new | past | comments | ask | show | jobs | submit login
Optimizing Large-Scale OpenStreetMap Data with SQLite (jtarchie.com)
173 points by thunderbong 1 day ago | hide | past | favorite | 34 comments





I recently discovered DuckDB's Read_OSM() function [1], which lets you query OSM PBF files directly.

For example, it's simple to count the cafes in North America in under 30s:

  SELECT COUNT(*) FROM st_readOSM('/home/wcedmisten/Downloads/north-america-latest.osm.pbf') WHERE tags['amenity'] = ['cafe'];
  ┌──────────────┐
  │ count_star() │
  │    int64     │
  ├──────────────┤
  │        57150 │
  └──────────────┘
  Run Time (s): real 24.643 user 379.067204 sys 3.696217

Unfortunately, I discovered there are still some bugs [2] that need to be ironed out, but it seems very promising for doing high performance queries with minimal effort.

[1]: https://duckdb.org/docs/extensions/spatial.html#st_readosm--...

[2]: https://github.com/duckdb/duckdb_spatial/issues/349


Fascinating use of DuckDB!

Can I ask where you get official OSM PBF data from? (I found these two links, but not sure what data these contain)

https://planet.openstreetmap.org/pbf/

http://download.geofabrik.de/


Those are the most popular sources, and I've used both!

The first one is the official OpenStreetMap data, which contains the "planet file" - i.e. all the data for the entire world. But because OSM has so much stuff in it, the planet file is a whopping 76 GB, which can take a long time to process for most tasks. I also recommend using the torrent file for faster download speeds.

As a result of the planet's size, the German company Geofabrik provides unofficial "extracts" of the data, which are filtered down to a specific region. E.g. all the data in a particular continent, country, or U.S. state. If you click on the "Sub Region" link it will show countries, and if you click on those it will show states.


I was able to find all 10 Whole Foods in the City of Chicago in 22.6s with DuckDB. It's amazing! (there are tons more Whole Foods in the metro area, but it found the exact 10 in the city)

        SELECT 
            tags['addr:city'][1] city,
            tags['addr:state'][1] state,
            tags['brand'][1] brand,
            *, 
        FROM st_readosm('us-latest.osm.pbf')
        WHERE 1=1
        and city = 'Chicago'
        and state = 'IL'
        and brand = 'Whole Foods Market'
I'm sure there are ways to make this faster (partitioning, indexing, COPY TO native format, etc.) but querying a 9.8GB compressed raw format file with data (in key-value fields stored as strings) for the entire United States at this speed is pretty impressive to me.

So it's GeoFabrik is the same data, but regionalized. This sounds like what I need. I already use DuckDB, so this is great.

I appreciate your taking the time to share this tidbit! It's a game changer in what I do (geospatial).


That's cool, but not what I would call high performance. If you do these often you would want an index, and should only take single digit ms.

The reason I call it high performance is that it avoids the hours/days of processing (for the planet file)[1] that would be required for pulling the data out of PBF and indexing it. And you'd also need RAM at least the size of the planet to even get that level of speed.

You could certainly amortize this cost for repeated queries, but for one-off queries I haven't seen anything faster.

[1]: https://wiki.openstreetmap.org/wiki/Osm2pgsql/benchmarks


You wouldn't need hundreds of gigs of ram to answer this query in 5ms. You'd need a few mb or so to answer this query, after initial indexing is done of course.

How long do you think it would take to index it?

Depends on the machine :) hours maybe?

So for a one-off query DuckDB is tons faster, and easier.

Not near a computer to try this out but I'd be surprised if you couldn't get a huge speed up by selecting the whole file into a real table first and querying against that. DuckDB should be able to better vectorize operations then

I liked the trick used here for speeding up tag key/value queries using a FTS index:

    SELECT id
    FROM entries e
    JOIN search s ON s.rowid = e.id
    WHERE
    -- use FTS index to find subset of possible results
    search MATCH 'amenity cafe'
    -- use the subset to find exact matches
    AND tags->>'amenity' = 'cafe';

So, hype aside, what's the over/under on DuckDB vs Sqlite these days? I'm working on a Thing right now, has started with sqlite due to being a) good enough, b) stupendously optimized and nails hardened, and c) runs on your phone, your toaster, and your server.

What's DuckDB bringing to the table relative to sqlite, which seems like the boring-and-therefore-best choice?


Ha, looks like the DuckDB devs have a great and balanced answer for this: https://marclamberti.com/blog/duckdb-getting-started-for-beg...

If you want row storage, use sqlite. If you want columnar storage, use DuckDB.


And thinking about my specific project, it's clear that a great solution would allow me to pick row-vs-column storage on a per-table basis.

I basically have a graph over vectors, which is a very 'row-storage' kind of things: I need to get a specific vector (a row of data), get all of its neighbors (rows in the edges table), and do some in-context computations to decide where to walk to next on the graph.

However, we also have some data attached to vectors (covariates, tags, etc) which we often want to work with in a more aggregated way. These tables seem possibly more reasonable to approach with a columnar format.


My intuition is:

1. For small individual queries, sqllite: think oltp. Naive graph databases are basically KV stores, and native ones, more optimized here. DuckDB may also be ok here but afaict it is columnar-oriented, and I'm unclear in oltp perf for row-oriented queries.

2. For bigger graph OLAP queries, where you might want to grab 1K, 100K, 1M, etc multi-hop results, duckdb may get more interesting. I'm not sure how optimized their left joins are, but that's basically the idea here.

FWIW, we began been building GFQL, which is basically cypher graph queries on dataframes, and runs on pandas (CPU) and optionally cudf (GPU). No DB needed, runs in-process. We use on anything from 100 edge graphs to 1 billion, and individual subgraph matches can be at that scale too, just depends on your RAM. It solves low throughout OLTP/dashboarding/etc for smaller graphs by skipping a DB, and we are also using for vector data, eg, 10M edge similarity graphs where we switch between weighted nearest neighbor edge traversals and rich symbolic node filters. It may be (much) faster than duckdb on GPUs, lower weight if you are in python, duckdb's benefits of arrow / columnar analytics ecosystem, and less gnarly looking / more maintainable if you are doing graph queries vs wrangling it as tabular SQL. New and a lot on the roadmap, but fun :)


Knowing how little I know about your problem and constraints, I'd use SQLite for everything. Then if a specific part seems columnar, do some experiments in DuckDB and see what speedups or space savings you get and then use both.

Sounds like you are more than one, so have some folks do some 4 hour spikes on DuckDB where you think it might be useful.

I'd use Rust as the top level, embedding SQLite, DuckDB and Lua. In my application I used SQLite with FTS and then many SQLite databases not more than 2GB in size with rows containing binary blobs that didn't uncompress to more than 20 megs each.


> a great solution would allow me to pick row-vs-column storage on a per-table basis ... and do some in-context computations to decide where to walk to next on the graph

You may be interested in Datomic[0] or Datascript[1]

[0]: https://www.datomic.com/benefits.html

[1]: https://github.com/tonsky/datascript


other(maybe more) important points: indexes vs non-indexes and heavy batch updates vs smaller updates and inserts.

DuckDB only supports either one read+write client or arbitrary read clients with no writer. This is the biggest blocker for adopting it in a real-world hosted app.

Rel: https://duckdb.org/docs/connect/concurrency.html#handling-co...


That's the same limitation as sqlite, actually.

> This highlights the importance of iterative refinement and the power of combining different technologies to solve problems.

This uninformative non-sentence sounds an awful lot like ChatGPT.


If you are interested in optimizing the project further, I can suggest you rebuilding SQLite with Profile-Guided Optimization (PGO). I collected as many as possible materials (including many related benchmarks) in my repo: https://github.com/zamazan4ik/awesome-pgo . Regarding SQLite and PGO, I have the following link: https://sqlite.org/forum/forumpost/19870fae957d8c1a

Have you tested PGO for FTS in SQLite? It wasn’t clear from all those links. Would be very interested in that.

Nope, I didn't test PGO explicitly on its FTS functionality. However, I am 99% sure that by enabling PGO for FTS you can expect ~5-10% performance win too.

> CREATE INDEX entries_name ON entries(tags->>'name'); > However, this requires an index per tag, which won’t scale, especially for a dynamic list of tags.

That's not how indexes work at all. This will be fine.


Wouldn’t creating an index for each tag cost a lot of disk space? There are hundreds (thousands?) of OSM tags. I like their approach of combining a FTS table with a normal JSON query for ad-hoc queries.

Sometime ago, I had to extract a large amount of data from OSM and found the process harder than it should have been. Similar to how you shrunk the size by 50% after removing unnecessary tags. Ended up creating a python package 'earth-osm' (https://github.com/pypsa-meets-earth/earth-osm) that makes things more intuitive. Always wanted to push the data into a database but never got around to that....

The GPKG files https://github.com/marklit/osm_split produces are SQLite files under the hood and come with an R-Tree Spatial index so they'll load into QGIS and other GIS software quickly.

I went through a similar process when I converted wikipedia xml dumps into sqlite.

https://news.ycombinator.com/item?id=28013347


If you convert those PBFs to Parquet files. You can then use Duckdb to search them with sub-sec response. Plus you get the added bonus of being able to host in an S3 type cloud storage.

https://duckdb.org/docs/extensions/httpfs/s3api


This is the business. I highly recommend this solution. Streaming queries via duckdb or polars will use little to memory and be lightning fast for very large datasets.

This is brilliant



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: