Hacker News new | past | comments | ask | show | jobs | submit login
Optimizing Large-Scale OpenStreetMap Data with SQLite (jtarchie.com)
205 points by thunderbong 9 months ago | hide | past | favorite | 53 comments



I recently discovered DuckDB's Read_OSM() function [1], which lets you query OSM PBF files directly.

For example, it's simple to count the cafes in North America in under 30s:

  SELECT COUNT(*) FROM st_readOSM('/home/wcedmisten/Downloads/north-america-latest.osm.pbf') WHERE tags['amenity'] = ['cafe'];
  ┌──────────────┐
  │ count_star() │
  │    int64     │
  ├──────────────┤
  │        57150 │
  └──────────────┘
  Run Time (s): real 24.643 user 379.067204 sys 3.696217

Unfortunately, I discovered there are still some bugs [2] that need to be ironed out, but it seems very promising for doing high performance queries with minimal effort.

[1]: https://duckdb.org/docs/extensions/spatial.html#st_readosm--...

[2]: https://github.com/duckdb/duckdb_spatial/issues/349


Fascinating use of DuckDB!

Can I ask where you get official OSM PBF data from? (I found these two links, but not sure what data these contain)

https://planet.openstreetmap.org/pbf/

http://download.geofabrik.de/


Those are the most popular sources, and I've used both!

The first one is the official OpenStreetMap data, which contains the "planet file" - i.e. all the data for the entire world. But because OSM has so much stuff in it, the planet file is a whopping 76 GB, which can take a long time to process for most tasks. I also recommend using the torrent file for faster download speeds.

As a result of the planet's size, the German company Geofabrik provides unofficial "extracts" of the data, which are filtered down to a specific region. E.g. all the data in a particular continent, country, or U.S. state. If you click on the "Sub Region" link it will show countries, and if you click on those it will show states.


I was able to find all 10 Whole Foods in the City of Chicago in 22.6s with DuckDB. It's amazing! (there are tons more Whole Foods in the metro area, but it found the exact 10 in the city)

        SELECT 
            tags['addr:city'][1] city,
            tags['addr:state'][1] state,
            tags['brand'][1] brand,
            *, 
        FROM st_readosm('us-latest.osm.pbf')
        WHERE 1=1
        and city = 'Chicago'
        and state = 'IL'
        and brand = 'Whole Foods Market'
I'm sure there are ways to make this faster (partitioning, indexing, COPY TO native format, etc.) but querying a 9.8GB compressed raw format file with data (in key-value fields stored as strings) for the entire United States at this speed is pretty impressive to me.


That's great. I was impressed with DuckDB's speed with the OSM data. I'm doing it with all the data, but I intended to have: - a single file that can be queried - a single file that can be easily compressed

When I index the entire United States, admittedly with a subset of tags, I can search in milliseconds. I'll take it.


1-2 seconds in Overpass which also uses OSM:

https://overpass-turbo.eu/s/1NTy


So it's GeoFabrik is the same data, but regionalized. This sounds like what I need. I already use DuckDB, so this is great.

I appreciate your taking the time to share this tidbit! It's a game changer in what I do (geospatial).


That's cool, but not what I would call high performance. If you do these often you would want an index, and should only take single digit ms.


The reason I call it high performance is that it avoids the hours/days of processing (for the planet file)[1] that would be required for pulling the data out of PBF and indexing it. And you'd also need RAM at least the size of the planet to even get that level of speed.

You could certainly amortize this cost for repeated queries, but for one-off queries I haven't seen anything faster.

[1]: https://wiki.openstreetmap.org/wiki/Osm2pgsql/benchmarks


You wouldn't need hundreds of gigs of ram to answer this query in 5ms. You'd need a few mb or so to answer this query, after initial indexing is done of course.


How long do you think it would take to index it?


Depends on the machine :) hours maybe?


So for a one-off query DuckDB is tons faster, and easier.


yeah, but it's not high performance, which was my original point.

I spend a lot of time optimizing stuff developers thought was "high performance" and they're scanning a 100gb+ dataset on every page load.


If the alternative takes hours, then ~30 seconds seems high performance to me.


DuckDB is indeed great for one off stuff.

But calling 30s high performance for quantifying 60k unique values like in this case is misleading at best. I try not to mislead people. :)


I can help answer this a bit.

Processing the data with code and SQLite takes about 4 hours for the United States and Canada. If I can figure out how to parallelize some of the work, it could be faster.

The benefit of using SQLite is its low memory footprint. I can have this 40GB file (the uncompressed version) and use a 256MB container instance. It shows how valuable SQLite is in terms of resources. It takes up disk space, but I followed up with the zstd compression (from the post).

Was this more work and should've I have used Postgres and osm2pql, yes. Was it fun, well yeah, because I have another feature, which I have not written about yet, which I highly value.


Not near a computer to try this out but I'd be surprised if you couldn't get a huge speed up by selecting the whole file into a real table first and querying against that. DuckDB should be able to better vectorize operations then


I did try DuckDB at first. Not for tags, but the relationships between the nodes, ways, and relations. It was slow. Really fast to important, however!


I liked the trick used here for speeding up tag key/value queries using a FTS index:

    SELECT id
    FROM entries e
    JOIN search s ON s.rowid = e.id
    WHERE
    -- use FTS index to find subset of possible results
    search MATCH 'amenity cafe'
    -- use the subset to find exact matches
    AND tags->>'amenity' = 'cafe';


At first, I didn't do it, but I had that shower thought of "What If?"

It worked! It was super helpful in finding more accurate results.


If you convert those PBFs to Parquet files. You can then use Duckdb to search them with sub-sec response. Plus you get the added bonus of being able to host in an S3 type cloud storage.

https://duckdb.org/docs/extensions/httpfs/s3api


That's an interesting take.

I also found [flatgeobuf](https://github.com/flatgeobuf/flatgeobuf), which promises something similar-ish. Perhaps not the same interface for querying, but organizing data in an r-tree format for streaming from a blob store.


This is the business. I highly recommend this solution. Streaming queries via duckdb or polars will use little to memory and be lightning fast for very large datasets.


So, hype aside, what's the over/under on DuckDB vs Sqlite these days? I'm working on a Thing right now, has started with sqlite due to being a) good enough, b) stupendously optimized and nails hardened, and c) runs on your phone, your toaster, and your server.

What's DuckDB bringing to the table relative to sqlite, which seems like the boring-and-therefore-best choice?


Ha, looks like the DuckDB devs have a great and balanced answer for this: https://marclamberti.com/blog/duckdb-getting-started-for-beg...

If you want row storage, use sqlite. If you want columnar storage, use DuckDB.


And thinking about my specific project, it's clear that a great solution would allow me to pick row-vs-column storage on a per-table basis.

I basically have a graph over vectors, which is a very 'row-storage' kind of things: I need to get a specific vector (a row of data), get all of its neighbors (rows in the edges table), and do some in-context computations to decide where to walk to next on the graph.

However, we also have some data attached to vectors (covariates, tags, etc) which we often want to work with in a more aggregated way. These tables seem possibly more reasonable to approach with a columnar format.


My intuition is:

1. For small individual queries, sqllite: think oltp. Naive graph databases are basically KV stores, and native ones, more optimized here. DuckDB may also be ok here but afaict it is columnar-oriented, and I'm unclear in oltp perf for row-oriented queries.

2. For bigger graph OLAP queries, where you might want to grab 1K, 100K, 1M, etc multi-hop results, duckdb may get more interesting. I'm not sure how optimized their left joins are, but that's basically the idea here.

FWIW, we began been building GFQL, which is basically cypher graph queries on dataframes, and runs on pure vectorized pandas (CPU) and optionally cudf (GPU). No DB needed, runs in-process, columnar. We use on anything from 100 edge graphs to 1 billion, and individual subgraph matches can be at that scale too, just depends on your RAM. It solves low throughout OLTP/dashboarding/etc for smaller graphs by skipping a DB, and we are also using for vector data, eg, AI pipelines on 10M edge similarity graphs where a single query combines weighted nearest neighbor edge graph traversals and rich symbolic tabular node filters. It may be (much) faster than duckdb on GPUs, lower weight if you are in python, duckdb's benefits of arrow / columnar analytics ecosystem, and less gnarly looking / more maintainable if you are doing graph queries vs wrangling it as tabular SQL. New and a lot on the roadmap, but fun :)


Knowing how little I know about your problem and constraints, I'd use SQLite for everything. Then if a specific part seems columnar, do some experiments in DuckDB and see what speedups or space savings you get and then use both.

Sounds like you are more than one, so have some folks do some 4 hour spikes on DuckDB where you think it might be useful.

I'd use Rust as the top level, embedding SQLite, DuckDB and Lua. In my application I used SQLite with FTS and then many SQLite databases not more than 2GB in size with rows containing binary blobs that didn't uncompress to more than 20 megs each.


> a great solution would allow me to pick row-vs-column storage on a per-table basis ... and do some in-context computations to decide where to walk to next on the graph

You may be interested in Datomic[0] or Datascript[1]

[0]: https://www.datomic.com/benefits.html

[1]: https://github.com/tonsky/datascript


other(maybe more) important points: indexes vs non-indexes and heavy batch updates vs smaller updates and inserts.


DuckDB only supports either one read+write client or arbitrary read clients with no writer. This is the biggest blocker for adopting it in a real-world hosted app.

Rel: https://duckdb.org/docs/connect/concurrency.html#handling-co...


That's the same limitation as sqlite, actually.


No. Once a writer connects to a DuckDB database file, it does not allow any other.

SQLite always lets you connect additional reader (and writer) processes. You do not run into any locks until you execute write transactions concurrently.


If you are interested in optimizing the project further, I can suggest you rebuilding SQLite with Profile-Guided Optimization (PGO). I collected as many as possible materials (including many related benchmarks) in my repo: https://github.com/zamazan4ik/awesome-pgo . Regarding SQLite and PGO, I have the following link: https://sqlite.org/forum/forumpost/19870fae957d8c1a


I have not tried that.

Will PGO work with CGO external libraries? I'm using https://github.com/mattn/go-sqlite3


Have you tested PGO for FTS in SQLite? It wasn’t clear from all those links. Would be very interested in that.


Nope, I didn't test PGO explicitly on its FTS functionality. However, I am 99% sure that by enabling PGO for FTS you can expect ~5-10% performance win too.


Thanks. I looked through the notes you posted but... honestly, it seems a bit disorganised and rather complicated. Not sure the juice is worth the squeeze for max 10% (we are getting 2-3s query time on 15m records after sharding and that seems good enough for now. We can speed this up by having more focused queries. Most of the time is spend on the bm25 ranking step. For anything less, like 200k records it's already blazingly fast).


Yep. Information is still a bit messy to navigate. I think the most "preprocessed" PGO source for now is this unfinished yet article: https://github.com/zamazan4ik/awesome-pgo/blob/main/article/...

Is a 10% performance win worth it or not of course depends on your case. For some situations, even 2x-3x performance is not worth it at all, in other cases, even a few percent (or even half of a percent) win is a huge thing (especially on Google-like scales). If different ways work for you fine - it's great since you don't need to spend your time with PGO!


I do some sharding, mainly by grid-based bounding boxes. This allows for searching with an area or overlapping areas FTS index. It means there is some data duplication, but it is minimal compared to the speed boost.


> This highlights the importance of iterative refinement and the power of combining different technologies to solve problems.

This uninformative non-sentence sounds an awful lot like ChatGPT.


Grammarly, but point taken. Will fix.


> CREATE INDEX entries_name ON entries(tags->>'name'); > However, this requires an index per tag, which won’t scale, especially for a dynamic list of tags.

That's not how indexes work at all. This will be fine.


I understand this is how they would work for a JSON/JSONB payload in sqlite. In Postgres, there are [indexes](https://www.postgresql.org/docs/current/datatype-json.html#J...), but sqlite doesn't support the same. You'd have to create individual indexes for the keys from the JSON payload you'd need.

If I miss a feature, I'd love to experiment more. Please let me know.


Wouldn’t creating an index for each tag cost a lot of disk space? There are hundreds (thousands?) of OSM tags. I like their approach of combining a FTS table with a normal JSON query for ad-hoc queries.


The FTS table also has the extra benefit of different tokenizers -- porter, unicode, trigram, etc. It allows less precise searches to happen but still returns some results. I liked this because I wanted the best effort, not exact results for some user queries.


Sometime ago, I had to extract a large amount of data from OSM and found the process harder than it should have been. Similar to how you shrunk the size by 50% after removing unnecessary tags. Ended up creating a python package 'earth-osm' (https://github.com/pypsa-meets-earth/earth-osm) that makes things more intuitive. Always wanted to push the data into a database but never got around to that....


I went through a similar process when I converted wikipedia xml dumps into sqlite.

https://news.ycombinator.com/item?id=28013347


This is almost the same. We both do compression. The compression layer it transparent to the application via VFS.

I have tried this trick with Wikipedia too. It don’t have the numbers. I’ll push up the code when I can.


The GPKG files https://github.com/marklit/osm_split produces are SQLite files under the hood and come with an R-Tree Spatial index so they'll load into QGIS and other GIS software quickly.


This is brilliant


Oh, you!




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: