That's cool, but not what I would call high performance. If you do these often y...

wcedmisten · 2024-07-06T18:38:20.000000Z

The reason I call it high performance is that it avoids the hours/days of processing (for the planet file)[1] that would be required for pulling the data out of PBF and indexing it. And you'd also need RAM at least the size of the planet to even get that level of speed.

You could certainly amortize this cost for repeated queries, but for one-off queries I haven't seen anything faster.

[1]: https://wiki.openstreetmap.org/wiki/Osm2pgsql/benchmarks

winrid · 2024-07-06T22:26:43.000000Z

You wouldn't need hundreds of gigs of ram to answer this query in 5ms. You'd need a few mb or so to answer this query, after initial indexing is done of course.

ericjmorey · 2024-07-06T22:43:35.000000Z

How long do you think it would take to index it?

winrid · 2024-07-06T23:01:15.000000Z

Depends on the machine :) hours maybe?

arp242 · 2024-07-07T05:11:09.000000Z

So for a one-off query DuckDB is tons faster, and easier.

winrid · 2024-07-07T18:41:27.000000Z

yeah, but it's not high performance, which was my original point.

I spend a lot of time optimizing stuff developers thought was "high performance" and they're scanning a 100gb+ dataset on every page load.

arp242 · 2024-07-08T11:02:27.000000Z

If the alternative takes hours, then ~30 seconds seems high performance to me.

winrid · 2024-07-08T17:19:07.000000Z

DuckDB is indeed great for one off stuff.

But calling 30s high performance for quantifying 60k unique values like in this case is misleading at best. I try not to mislead people. :)

jtarchie · 2024-07-08T16:45:17.000000Z

I can help answer this a bit.

Processing the data with code and SQLite takes about 4 hours for the United States and Canada. If I can figure out how to parallelize some of the work, it could be faster.

The benefit of using SQLite is its low memory footprint. I can have this 40GB file (the uncompressed version) and use a 256MB container instance. It shows how valuable SQLite is in terms of resources. It takes up disk space, but I followed up with the zstd compression (from the post).

Was this more work and should've I have used Postgres and osm2pql, yes. Was it fun, well yeah, because I have another feature, which I have not written about yet, which I highly value.

wild_egg · 2024-07-06T18:03:21.000000Z

Not near a computer to try this out but I'd be surprised if you couldn't get a huge speed up by selecting the whole file into a real table first and querying against that. DuckDB should be able to better vectorize operations then