Hacker News new | past | comments | ask | show | jobs | submit login
How We Mapped 1.3M Data Points Using Mapbox (opennews.org)
105 points by danso on Sept 13, 2018 | hide | past | web | favorite | 20 comments

Next time they better talk to experts for their own sanities' sake...!

> We set the interpolation process running in QGIS on a Mac Pro and, a mere 11 days later, found ourselves the proud owners of a raster 250m2 grid layer

Probably done in minutes to hours with a properly setup PostGIS database and one query. "Data Journalists" seem to love it when things take long, thinking they are doing something super innovative and novel. Often it's just that they were using the wrong hammer...

> 1. Don’t use GeoTIFFs for geometric visualization

What? I am not sure what exactly the image is supposed to show but it seems like some rescaling/filtering. Not the fault of GeoTIFF.

Why not render the raster tiles locally? gdal2tiles or Tilemill can do that for you.

Why does Mapbox recommend LZW for compression? DEFLATE is usually much smaller and with the horizontal differencing predictor it should be even more smaller.

> 2. Your GeoJSON can probably be way smaller

Does Mapbox not support TopoJSON? That helps a lot for local data with quantisation.

> Because nothing in GIS is straightforward, attempting to round the values of 2.7m polygons using the field calculator in QGIS invariably resulted in the application crashing, even on a high-end Mac Pro. After several attempts using QGIS 2.16, 2.18 and 3.2,

Did they ask the community or even just provide a bug report? This should not happen from a QGIS fault.

Yea, as someone who does this sort of thing for a living much of that was quite painful to read. I mean hats off to them for getting a nice result, but they could have done it so much more efficiently by using the right tools.

While I don't disagree, this isn't something they do on a daily basis like you, so they're going in with blinders on.

Unfortunately, mapping documentation is absolutely abysmal. 90% of what's available online is either eight years out of date, riddled with TODO:'s, or a mish-mash of incompatible versions.

As someone who doesn't do this for a living, but has to build 70,000 maps on a weekly basis, I know it's a nightmare. Especially when you're going in for the first time.

I think they did a pretty good job considering they have to be jacks of all trades.

My OP was very negative, sorry.

Absolutely! Solving what you are set out to do is a great result and I am super happy for them and their end result!

But this state of documentation (as you call it, I would also/rather consider all the needed background knowledge and lingo to be a hurdle) is all the more reason to just document their goal, their data(s), their abilities and capabilities and ask an expert for 15 minutes of their time for input and pointers. That is so much more efficient!

Yea, I guess I was being a bit harsh. They delivered a nice looking result and at the end of day that's what counts. And honestly thinking back at the first several times I did something similar I can't really say I did a more efficient job.

I'd be interested in mapping solutions for python. What is your preferred workflow, HN?

Depends on the kind of data. Check out cartopy, descartes, rasterio, geopandas.

Another general point is that once you're working with sufficiently large data sets you no longer have a GIS problem, but a Big Data/High Performance Computing problem and you need to start working with tools available in those domains together with the GIS tools.

Like in the article. If you have a problem that takes 11 days to run in QGIS, then you shouldn't be using QGIS, but a tool that is designed for processing large amounts of data. 1-2 million points might be a lot from a GIS tools perspective, but is absolutely nothing from a big data/HPC perspective, so be sure to check what those guy are doing.

So far I've been able to avoid that by only looking at chunks of the datasets. Can you give me some keyworks for what to look for/point me in the generally right direction? (I dread the day my supervisor wants to look at all the data at once.)

First rule. PostGIS is your friend. The PostGIS people have done a lot of work with making processing large sets of GIS data fast and easy, so always start there. If for some reason that doesn't work...

As a general rule always try to ask yourself, what am I actually trying to do, mathematically, and then ask how would someone go about solving that math problem, if you didn't tell them it was a GIS problem?

Much of raster analysis, for example, is just a combination of matrix math and convolutions. So look up how the numerical analysis people do matrix math and convolution on huge matrices. In python, for example, you have tools like numexpr for fast elementwise transformation of matrices and via numpy/scipy you can call BLAS and LAPACK. If you're dealing with rasters that don't fit in memory take a look at solutions like Dask.

Same principle applies when dealing with vector data. Many problems are 'just' graph theory, so find out which library that graph theory people use to solve that sort of problem on large graphs and use that instead. Or if a problem reduces to a line/polygon intersection problem, well that's just raycasting, and the games industry has spent a lot of effort in making that really really fast.

Finally, learn to use the underlying libraries from the command line and scripts rather than via the GUI and how to divide that work across several processors/machines. GDAL + GNU parallel from the command line will transform 1000 rasters faster than QGIS could ever hope to do.

I'm surprised they didn't consider hosting their own vector tile server, instead of painstakingly rasterizing their entire dataset and optimizing it over and over to fit into the Mapbox data caps.

My company recently hit the need for maps showing several of our own proprietary layers at various zoom layers, and considering how much of the data is already in GeoJSON, converting that to the Mapbox Vector Tile binary format (MVT) was a breeze to implement on our own servers, compared to rasterizing all of the layers and re-rendering whenever any part of our datatset changes.

I'm surprised about this as well. Even if they didn't want the overhead of running their own tile server, one of the best parts of the MVT format in my mind is that one-off tilesets like this layer can trivially be thrown into S3 and served extremely cheaply as well. That removes a ton of the effort they put into circumventing MapBox's limitations and is cheaper to host.

And presumably be cached through a service worker, which might be very useful.

It's interesting seeing the process they went through to work around MapBox's hosting limitations, but the map doesn't feel terribly usable to me. The chart doesn't update on pan/zoom (at least for me in Firefox) and the raster area is a little funky, I wish it was using a standard interpolation that covered everywhere instead of selective areas.

Also, I am loading a ton of data on this page. Interactive maps are nice and I love them but it's totally not usable on a mobile network for a ton of people (ironically given the content of the map). Very curious why they are using React at all, seems like extra page bloat for just one interactive map + chart. Do they test usability on 3G networks? I did not have great load times when off wifi. I think this would be better on a stand-alone page so you can read the article without having your load time impacted by the extra JavaScript.

To be fair, it’s a journalistic exercise not a long running webapp project, if the page loads a second later but development time is significantly reduced, it might be worth the tradeoff. The map works well for me, using mobile on wifi.

>it's totally not usable on a mobile network for a ton of people

As noted in the sections:

4. Maps aren’t truly responsive by default (but are annoying on mobile by default)

5. Performance, particularly on mobile, will need all the help it can get

As someone who worked for me used to say: "It's better than good. It's done."

Yes, the top-line emphasis on MapBox when all the actually useful work was done by other tools to get around MapBox API limitations. A little funny.

anyone else have experience with GDAL? god i hate that software so much but it does some cool stuff. Been working with GDAL/QGIS on the regs for about 5 years now and it's pretty amazing what you can do. everything looks great in a map

Yes, ask me anything and I will try to help you.

Can you try to categorize what you hate about it? If it are specific things and not the general hurdle of figuring where goes what and what means what.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact