
How We Mapped 1.3M Data Points Using Mapbox - danso
https://source.opennews.org/articles/how-we-made-our-broadband-map-using-mapbox/
======
llao
Next time they better talk to experts for their own sanities' sake...!

> We set the interpolation process running in QGIS on a Mac Pro and, a mere 11
> days later, found ourselves the proud owners of a raster 250m2 grid layer

Probably done in minutes to hours with a properly setup PostGIS database and
one query. "Data Journalists" seem to love it when things take long, thinking
they are doing something super innovative and novel. Often it's just that they
were using the wrong hammer...

> 1\. Don’t use GeoTIFFs for geometric visualization

What? I am not sure what exactly the image is supposed to show but it seems
like some rescaling/filtering. Not the fault of GeoTIFF.

Why not render the raster tiles locally? gdal2tiles or Tilemill can do that
for you.

Why does Mapbox recommend LZW for compression? DEFLATE is usually much smaller
and with the horizontal differencing predictor it should be even more smaller.

> 2\. Your GeoJSON can probably be way smaller

Does Mapbox not support TopoJSON? That helps a lot for local data with
quantisation.

> Because nothing in GIS is straightforward, attempting to round the values of
> 2.7m polygons using the field calculator in QGIS invariably resulted in the
> application crashing, even on a high-end Mac Pro. After several attempts
> using QGIS 2.16, 2.18 and 3.2,

Did they ask the community or even just provide a bug report? This should not
happen from a QGIS fault.

~~~
dagw
Yea, as someone who does this sort of thing for a living much of that was
quite painful to read. I mean hats off to them for getting a nice result, but
they could have done it so much more efficiently by using the right tools.

~~~
reaperducer
While I don't disagree, this isn't something they do on a daily basis like
you, so they're going in with blinders on.

Unfortunately, mapping documentation is absolutely abysmal. 90% of what's
available online is either eight years out of date, riddled with TODO:'s, or a
mish-mash of incompatible versions.

As someone who doesn't do this for a living, but has to build 70,000 maps on a
weekly basis, I know it's a nightmare. Especially when you're going in for the
first time.

I think they did a pretty good job considering they have to be jacks of all
trades.

~~~
kaybe
I'd be interested in mapping solutions for python. What is your preferred
workflow, HN?

~~~
llao
Depends on the kind of data. Check out cartopy, descartes, rasterio,
geopandas.

~~~
dagw
Another general point is that once you're working with sufficiently large data
sets you no longer have a GIS problem, but a Big Data/High Performance
Computing problem and you need to start working with tools available in those
domains together with the GIS tools.

Like in the article. If you have a problem that takes 11 days to run in QGIS,
then you shouldn't be using QGIS, but a tool that is designed for processing
large amounts of data. 1-2 million points might be a lot from a GIS tools
perspective, but is absolutely nothing from a big data/HPC perspective, so be
sure to check what those guy are doing.

~~~
kaybe
So far I've been able to avoid that by only looking at chunks of the datasets.
Can you give me some keyworks for what to look for/point me in the generally
right direction? (I dread the day my supervisor wants to look at all the data
at once.)

~~~
dagw
First rule. PostGIS is your friend. The PostGIS people have done a lot of work
with making processing large sets of GIS data fast and easy, so always start
there. If for some reason that doesn't work...

As a general rule always try to ask yourself, what am I actually trying to do,
mathematically, and then ask how would someone go about solving that math
problem, if you didn't tell them it was a GIS problem?

Much of raster analysis, for example, is just a combination of matrix math and
convolutions. So look up how the numerical analysis people do matrix math and
convolution on huge matrices. In python, for example, you have tools like
numexpr for fast elementwise transformation of matrices and via numpy/scipy
you can call BLAS and LAPACK. If you're dealing with rasters that don't fit in
memory take a look at solutions like Dask.

Same principle applies when dealing with vector data. Many problems are 'just'
graph theory, so find out which library that graph theory people use to solve
that sort of problem on large graphs and use that instead. Or if a problem
reduces to a line/polygon intersection problem, well that's just raycasting,
and the games industry has spent a lot of effort in making that really really
fast.

Finally, learn to use the underlying libraries from the command line and
scripts rather than via the GUI and how to divide that work across several
processors/machines. GDAL + GNU parallel from the command line will transform
1000 rasters faster than QGIS could ever hope to do.

------
jerluc
I'm surprised they didn't consider hosting their own vector tile server,
instead of painstakingly rasterizing their entire dataset and optimizing it
over and over to fit into the Mapbox data caps.

My company recently hit the need for maps showing several of our own
proprietary layers at various zoom layers, and considering how much of the
data is already in GeoJSON, converting that to the Mapbox Vector Tile binary
format (MVT) was a breeze to implement on our own servers, compared to
rasterizing all of the layers and re-rendering whenever any part of our
datatset changes.

~~~
dylrich
I'm surprised about this as well. Even if they didn't want the overhead of
running their own tile server, one of the best parts of the MVT format in my
mind is that one-off tilesets like this layer can trivially be thrown into S3
and served extremely cheaply as well. That removes a ton of the effort they
put into circumventing MapBox's limitations and is cheaper to host.

~~~
Brakenshire
And presumably be cached through a service worker, which might be very useful.

------
dylrich
It's interesting seeing the process they went through to work around MapBox's
hosting limitations, but the map doesn't feel terribly usable to me. The chart
doesn't update on pan/zoom (at least for me in Firefox) and the raster area is
a little funky, I wish it was using a standard interpolation that covered
everywhere instead of selective areas.

Also, I am loading a ton of data on this page. Interactive maps are nice and I
love them but it's totally not usable on a mobile network for a ton of people
(ironically given the content of the map). Very curious why they are using
React at all, seems like extra page bloat for just one interactive map +
chart. Do they test usability on 3G networks? I did not have great load times
when off wifi. I think this would be better on a stand-alone page so you can
read the article without having your load time impacted by the extra
JavaScript.

~~~
Brakenshire
To be fair, it’s a journalistic exercise not a long running webapp project, if
the page loads a second later but development time is significantly reduced,
it might be worth the tradeoff. The map works well for me, using mobile on
wifi.

------
stirbot
Actual map is here: [https://ig.ft.com/gb-broadband-speed-
map/](https://ig.ft.com/gb-broadband-speed-map/)

------
brootstrap
anyone else have experience with GDAL? god i hate that software so much but it
does some cool stuff. Been working with GDAL/QGIS on the regs for about 5
years now and it's pretty amazing what you can do. everything looks great in a
map

~~~
llao
Yes, ask me anything and I will try to help you.

Can you try to categorize what you hate about it? If it are specific things
and not the general hurdle of figuring where goes what and what means what.

