I recently discovered DuckDB's Read_OSM() function [1], which lets you query OSM PBF files directly.
For example, it's simple to count the cafes in North America in under 30s:
SELECT COUNT(*) FROM st_readOSM('/home/wcedmisten/Downloads/north-america-latest.osm.pbf') WHERE tags['amenity'] = ['cafe'];
┌──────────────┐
│ count_star() │
│ int64 │
├──────────────┤
│ 57150 │
└──────────────┘
Run Time (s): real 24.643 user 379.067204 sys 3.696217
Unfortunately, I discovered there are still some bugs [2] that need to be ironed out, but it seems very promising for doing high performance queries with minimal effort.
Those are the most popular sources, and I've used both!
The first one is the official OpenStreetMap data, which contains the "planet file" - i.e. all the data for the entire world. But because OSM has so much stuff in it, the planet file is a whopping 76 GB, which can take a long time to process for most tasks. I also recommend using the torrent file for faster download speeds.
As a result of the planet's size, the German company Geofabrik provides unofficial "extracts" of the data, which are filtered down to a specific region. E.g. all the data in a particular continent, country, or U.S. state. If you click on the "Sub Region" link it will show countries, and if you click on those it will show states.
I was able to find all 10 Whole Foods in the City of Chicago in 22.6s with DuckDB. It's amazing! (there are tons more Whole Foods in the metro area, but it found the exact 10 in the city)
SELECT
tags['addr:city'][1] city,
tags['addr:state'][1] state,
tags['brand'][1] brand,
*,
FROM st_readosm('us-latest.osm.pbf')
WHERE 1=1
and city = 'Chicago'
and state = 'IL'
and brand = 'Whole Foods Market'
I'm sure there are ways to make this faster (partitioning, indexing, COPY TO native format, etc.) but querying a 9.8GB compressed raw format file with data (in key-value fields stored as strings) for the entire United States at this speed is pretty impressive to me.
That's great. I was impressed with DuckDB's speed with the OSM data. I'm doing it with all the data, but I intended to have:
- a single file that can be queried
- a single file that can be easily compressed
When I index the entire United States, admittedly with a subset of tags, I can search in milliseconds. I'll take it.
The reason I call it high performance is that it avoids the hours/days of processing (for the planet file)[1] that would be required for pulling the data out of PBF and indexing it. And you'd also need RAM at least the size of the planet to even get that level of speed.
You could certainly amortize this cost for repeated queries, but for one-off queries I haven't seen anything faster.
You wouldn't need hundreds of gigs of ram to answer this query in 5ms. You'd need a few mb or so to answer this query, after initial indexing is done of course.
Processing the data with code and SQLite takes about 4 hours for the United States and Canada. If I can figure out how to parallelize some of the work, it could be faster.
The benefit of using SQLite is its low memory footprint. I can have this 40GB file (the uncompressed version) and use a 256MB container instance. It shows how valuable SQLite is in terms of resources. It takes up disk space, but I followed up with the zstd compression (from the post).
Was this more work and should've I have used Postgres and osm2pql, yes. Was it fun, well yeah, because I have another feature, which I have not written about yet, which I highly value.
Not near a computer to try this out but I'd be surprised if you couldn't get a huge speed up by selecting the whole file into a real table first and querying against that. DuckDB should be able to better vectorize operations then
I liked the trick used here for speeding up tag key/value queries using a FTS index:
SELECT id
FROM entries e
JOIN search s ON s.rowid = e.id
WHERE
-- use FTS index to find subset of possible results
search MATCH 'amenity cafe'
-- use the subset to find exact matches
AND tags->>'amenity' = 'cafe';
If you convert those PBFs to Parquet files. You can then use Duckdb to search them with sub-sec response. Plus you get the added bonus of being able to host in an S3 type cloud storage.
I also found [flatgeobuf](https://github.com/flatgeobuf/flatgeobuf), which promises something similar-ish. Perhaps not the same interface for querying, but organizing data in an r-tree format for streaming from a blob store.
This is the business. I highly recommend this solution. Streaming queries via duckdb or polars will use little to memory and be lightning fast for very large datasets.
So, hype aside, what's the over/under on DuckDB vs Sqlite these days? I'm working on a Thing right now, has started with sqlite due to being a) good enough, b) stupendously optimized and nails hardened, and c) runs on your phone, your toaster, and your server.
What's DuckDB bringing to the table relative to sqlite, which seems like the boring-and-therefore-best choice?
And thinking about my specific project, it's clear that a great solution would allow me to pick row-vs-column storage on a per-table basis.
I basically have a graph over vectors, which is a very 'row-storage' kind of things: I need to get a specific vector (a row of data), get all of its neighbors (rows in the edges table), and do some in-context computations to decide where to walk to next on the graph.
However, we also have some data attached to vectors (covariates, tags, etc) which we often want to work with in a more aggregated way. These tables seem possibly more reasonable to approach with a columnar format.
1. For small individual queries, sqllite: think oltp. Naive graph databases are basically KV stores, and native ones, more optimized here. DuckDB may also be ok here but afaict it is columnar-oriented, and I'm unclear in oltp perf for row-oriented queries.
2. For bigger graph OLAP queries, where you might want to grab 1K, 100K, 1M, etc multi-hop results, duckdb may get more interesting. I'm not sure how optimized their left joins are, but that's basically the idea here.
FWIW, we began been building GFQL, which is basically cypher graph queries on dataframes, and runs on pure vectorized pandas (CPU) and optionally cudf (GPU). No DB needed, runs in-process, columnar. We use on anything from 100 edge graphs to 1 billion, and individual subgraph matches can be at that scale too, just depends on your RAM. It solves low throughout OLTP/dashboarding/etc for smaller graphs by skipping a DB, and we are also using for vector data, eg, AI pipelines on 10M edge similarity graphs where a single query combines weighted nearest neighbor edge graph traversals and rich symbolic tabular node filters. It may be (much) faster than duckdb on GPUs, lower weight if you are in python, duckdb's benefits of arrow / columnar analytics ecosystem, and less gnarly looking / more maintainable if you are doing graph queries vs wrangling it as tabular SQL. New and a lot on the roadmap, but fun :)
Knowing how little I know about your problem and constraints, I'd use SQLite for everything. Then if a specific part seems columnar, do some experiments in DuckDB and see what speedups or space savings you get and then use both.
Sounds like you are more than one, so have some folks do some 4 hour spikes on DuckDB where you think it might be useful.
I'd use Rust as the top level, embedding SQLite, DuckDB and Lua. In my application I used SQLite with FTS and then many SQLite databases not more than 2GB in size with rows containing binary blobs that didn't uncompress to more than 20 megs each.
> a great solution would allow me to pick row-vs-column storage on a per-table basis ... and do some in-context computations to decide where to walk to next on the graph
You may be interested in Datomic[0] or Datascript[1]
DuckDB only supports either one read+write client or arbitrary read clients with no writer. This is the biggest blocker for adopting it in a real-world hosted app.
No. Once a writer connects to a DuckDB database file, it does not allow any other.
SQLite always lets you connect additional reader (and writer) processes. You do not run into any locks until you execute write transactions concurrently.
If you are interested in optimizing the project further, I can suggest you rebuilding SQLite with Profile-Guided Optimization (PGO). I collected as many as possible materials (including many related benchmarks) in my repo: https://github.com/zamazan4ik/awesome-pgo . Regarding SQLite and PGO, I have the following link: https://sqlite.org/forum/forumpost/19870fae957d8c1a
Nope, I didn't test PGO explicitly on its FTS functionality. However, I am 99% sure that by enabling PGO for FTS you can expect ~5-10% performance win too.
Thanks. I looked through the notes you posted but... honestly, it seems a bit disorganised and rather complicated. Not sure the juice is worth the squeeze for max 10% (we are getting 2-3s query time on 15m records after sharding and that seems good enough for now. We can speed this up by having more focused queries. Most of the time is spend on the bm25 ranking step. For anything less, like 200k records it's already blazingly fast).
Is a 10% performance win worth it or not of course depends on your case. For some situations, even 2x-3x performance is not worth it at all, in other cases, even a few percent (or even half of a percent) win is a huge thing (especially on Google-like scales). If different ways work for you fine - it's great since you don't need to spend your time with PGO!
I do some sharding, mainly by grid-based bounding boxes. This allows for searching with an area or overlapping areas FTS index. It means there is some data duplication, but it is minimal compared to the speed boost.
> CREATE INDEX entries_name ON entries(tags->>'name');
> However, this requires an index per tag, which won’t scale, especially for a dynamic list of tags.
That's not how indexes work at all. This will be fine.
I understand this is how they would work for a JSON/JSONB payload in sqlite. In Postgres, there are [indexes](https://www.postgresql.org/docs/current/datatype-json.html#J...), but sqlite doesn't support the same. You'd have to create individual indexes for the keys from the JSON payload you'd need.
If I miss a feature, I'd love to experiment more. Please let me know.
Wouldn’t creating an index for each tag cost a lot of disk space? There are hundreds (thousands?) of OSM tags. I like their approach of combining a FTS table with a normal JSON query for ad-hoc queries.
The FTS table also has the extra benefit of different tokenizers -- porter, unicode, trigram, etc. It allows less precise searches to happen but still returns some results. I liked this because I wanted the best effort, not exact results for some user queries.
Sometime ago, I had to extract a large amount of data from OSM and found the process harder than it should have been. Similar to how you shrunk the size by 50% after removing unnecessary tags. Ended up creating a python package 'earth-osm' (https://github.com/pypsa-meets-earth/earth-osm) that makes things more intuitive. Always wanted to push the data into a database but never got around to that....
The GPKG files https://github.com/marklit/osm_split produces are SQLite files under the hood and come with an R-Tree Spatial index so they'll load into QGIS and other GIS software quickly.
For example, it's simple to count the cafes in North America in under 30s:
Unfortunately, I discovered there are still some bugs [2] that need to be ironed out, but it seems very promising for doing high performance queries with minimal effort.[1]: https://duckdb.org/docs/extensions/spatial.html#st_readosm--...
[2]: https://github.com/duckdb/duckdb_spatial/issues/349