
How ProPublica Illinois Uses GNU Make to Load Data - danso
https://www.propublica.org/nerds/gnu-make-illinois-campaign-finance-data-david-eads-propublica-illinois#146596
======
rainbowmverse
A lot of comments in here are poking fun at how little data it is relative to
a commercial data mining operation. The data they process and what they do
with it is worth more to society than any number of petabytes crunched to
target ads. Processing petty petabytes is not praiseworthy.

If you focus on the headline, you'll miss the point. The point is they used
open source technology to process public data for reporting once the
government stopped updating its own tools.

~~~
qubax
> The data they process and what they do with it is worth more to society than
> any number of petabytes crunched to target ads.

No it isn't. You are getting defensive for no reason. If propublica ceases to
exist or never existed, it wouldn't matter a single bit to the world. You
could even argue the world would be better off.

> Processing petty petabytes is not praiseworthy.

From a technical point and many other ways, it is.

I don't get why you are getting offended by people making a jab at the scant
amount of data. Last I checked, hacker news is a technology oriented site. And
from a technology point of view, what pro publica is doing is a joke. It's a
toy amount of data.

Why not just say pro publica is not a technology company and hence people
shouldn't expect technological feats of wonder?

> The point is they used open source technology to process public data for
> reporting once the government stopped updating its own tools.

Which is something I could have done on a lazy afternoon all by myself. It
isn't anything to be impressed about. But good for them anyways.

~~~
danso
ProPublica is not a technology company, it's a non-profit investigative
journalism outlet.

They and countless other journalism/civic orgs would likely be happy for you
to show them up by whipping up usable ETL scripts relevant in their respective
domains. Since it all involves public open data you don't have to wait for
anyone's permission.

------
peterwwillis
Since they have "A Note about Security", how about locking down that Python
environment?

\- Add hashes to their already-pinned requirements.txt deps:
[https://pip.pypa.io/en/stable/reference/pip_install/#hash-
ch...](https://pip.pypa.io/en/stable/reference/pip_install/#hash-checking-
mode)

\- Add a Makefile entry to run `[ -d your-environment ] || ( virtualenv your-
environment && . your-environment/bin/activate && ./your-environment/bin/pip
install --no-deps --require-hashes -r requirements.txt )`

------
bazizbaziz
Minor nitpick about their exit code technique [0]: The command checks if the
table exists, but it does not appear to re-run if the source file has been
updated. Usually with Make you expect it to re-run the database load if the
source file has changed.

It's better to use empty targets [1] to track when the file has last been
loaded and re-run if the dependency has been changed.

[0]
[https://github.com/propublica/ilcampaigncash/blob/master/Mak...](https://github.com/propublica/ilcampaigncash/blob/master/Makefile#L27)

[1] [https://www.gnu.org/software/make/manual/html_node/Empty-
Tar...](https://www.gnu.org/software/make/manual/html_node/Empty-Targets.html)

------
danso
> _The first is that we use Aria2 to handle FTP duties. Earlier versions of
> the script used other FTP clients that were either slow as molasses or
> painful to use. After some trial and error, I found Aria2 did the job better
> than lftp (which is fast but fussy) or good old ftp (which is both slow and
> fussy). I also found some incantations that took download times from roughly
> an hour to less than 20 minutes._

Tangential question: is it possible to use wget for ftp duties? Though may be
additional FTP-specific functionality in `aria2c` of course:

[https://serverfault.com/questions/25199/using-wget-to-
recurs...](https://serverfault.com/questions/25199/using-wget-to-recursively-
download-whole-ftp-directories)

~~~
rasz
aria is multi connection(aria2c -x5 means five concurrent), thats the main
reason for speed bump

~~~
flukus
Does this increase speed when it's only downloading a single file at a time?
It might be better off using makes multi process (make -j 5) to be able to
process data while still loading other data.

~~~
therein
Each connection requests a different range within the same file and they
download together.

------
rockmeamedee
Make is often brought out for data, "single machine ETL" jobs, but for big,
complicated (and iterative) workflows it doesn't feel good enough to me.

What do you folks use? Drake, "make for data"
[https://github.com/Factual/drake](https://github.com/Factual/drake) seems ok,
but doesn't have "batch" jobs, (aka "pattern rules") where you can do every
file in a directory matching a pattern.

Others have come up with different swiss army knives but nothing ever sticks
for me, it usually ends up as a single Makefile with eg 3 targets that call a
bunch of shell scripts.

The whole thing would be configurable to build from scratch, but not well set
up to do incremental ETL on a per file basis, after I eg delete some
extraneous rows in one file, clean up a column, redownload a folder, or add
files to a dataset.

~~~
rspeer
I use Snakemake [1], a parallel make system for data, designed around pattern-
matching rules. The rules are either shell commands or Python 3 code.

I settled on it after originally using make, getting frustrated with the crazy
work-arounds I needed to implement because it doesn't understand build steps
with multiple outputs, switching to Ninja where you have to construct the
dependency tree yourself, and finally ending up on Snakemake which does
everything I need.

[1]
[https://snakemake.readthedocs.io/en/stable/](https://snakemake.readthedocs.io/en/stable/)

~~~
reacharavindh
Thank you for sharing this information about snakemake. I administer a cluster
for a group of geneticists. I'll try to get them to use it for their
publications to make their results easily reproducible by others.

------
pdkl95
> [...] --ftp-passwd="$(ILCAMPAIGNCASH_FTP_PASSWD)"
> ftp://ftp.elections.il.gov/[...]

Is that using traditional (plaintext) FTP? Is it listening on port 21?

    
    
        ~ $ ftp ftp.elections.il.gov
        Connected to ftp.elections.il.gov (163.191.231.32).
        220-Microsoft FTP Service
        220 SBE
        Name (ftp.elections.il.gov): ^C
    

It looks like they are sending their password in plaintext. aria2 supports
SFTP, so they should really talk to elections.il.gov about moving to SFTP or
any other protocol that doesn't send the password in plaintext.

~~~
danso
I imagine there would be other systems (state-owned and private) that use the
FTP server, and maybe in a way that changing protocols is inexplicably full of
friction. I wonder why the elections server, assuming it only contains records
legal to distribute to the public, is even password protected. Maybe it was a
policy when govt bandwidth was scarce. California, for example, has campaign
finance data on a public webserver:
[https://www.californiacivicdata.org/](https://www.californiacivicdata.org/)

And the FEC has an API, but has long had the data hosted on public FTP:
[https://classic.fec.gov/finance/disclosure/ftp_download.shtm...](https://classic.fec.gov/finance/disclosure/ftp_download.shtml)

------
chemicalcrux
I did something like this a few years ago! I needed to do a bunch of
transformations and measurements of data that came in on a regular basis. Make
was a perfect fit - I could test the whole process with a single command,
cleaning either just the result data, or nuking everything to make sure it
pulled stuff in properly.

I spent some time trying to write my own processing system in Python before
realizing this was a familiar task...

------
lasermike026
I really like make! I use it almost every day. I like it for the structure and
simplicity. I don't use it for everything. I plan to use it for the
foreseeable future.

Why do I like make over shell scripting (sometimes) is that enforces
structure. Shell scripts can turn into a real hairball.

When I did ruby I really enjoyed using rake.

------
stakhanov
On the whole debate revolving around gigabytes in the title, I'd like to add:

There's a well-substantiated linguistic theory revolving around "maxims of
conversation". Maxims of conversation are so strongly universal among the
speakers of a given language that they become part of the implied meaning of a
conversational act.

For example the maxim of cooperativitiy implies that when a person sitting in
a cold room next to a window is spoken to by a person sitting further from the
window and is being told "It's a bit chilly, isn't it", they can take it to
mean "Please close the window".

[https://en.wikipedia.org/wiki/Implicature#Conversational_imp...](https://en.wikipedia.org/wiki/Implicature#Conversational_implicature)

Similarly, there are certain maxims of conversation which are part of the
language game inherent in the formulation of the title of a blogpost. They are
kind of assumed to be boasting about something. So when somebody says "We
figured out a way to load a gigabyte's worth of data into a database in a
single day" then the being-boastful-about-something maxim is violated. That's
why it triggered so many people.

And pointing out that this is not something to be boastful about is a
perfectly valid thing to do to keep certain facts straight.

...just saying.

But, by all means, if you get a thrill out of it, keep downvoting me.

------
dev_dull
Can somehow help me understand the advantages of Make over a bash script?
Isn’t bash superior in almost every way?

~~~
mmt
In addition to all the features the sibling comments noted, it's important to
note that there's no contradiction:

bash is usually the scripting language one uses inside of a Makefile.

It's the default, although one could use any scripting language. Point being,
there's no "Make" language, beyond the syntax for describing those dependency
relationships and variable assignments.

~~~
pletnes
I believe /bin/sh is the default, not bash. But this can be changed.

~~~
mmt
You're right that it's /bin/sh, but, since it could be (and is, in some cases)
bash, it's not quite right to call it "not bash", either.

I'll grant that the distinction is important, though, in the face of the
history of #!/bin/sh Linux scripts with bashisms breaking upon the
Debian/Ubuntu switch to dash. Even if you're on a system where /bin/sh is
bash, it's safest to set SHELL in your GNU makefiles to bash explicitly, if
that's what you you're writing in.

------
beebmam
This is exactly the kind of purpose I love seeing open source tools used for.
Kudos to propublica for leveraging open source to improve their ability to
function!

~~~
catacombs
While it's nice to see Propublica use open source software, keep in mind
dozens and dozens of other news organizations use the same tools.

------
rpz
Shoulda used kdb!

------
brudgers
Original, [https://www.propublica.org/nerds/gnu-make-illinois-
campaign-...](https://www.propublica.org/nerds/gnu-make-illinois-campaign-
finance-data-david-eads-propublica-illinois#146596)

~~~
yjftsjthsd-h
I appreciate that it's under a /nerds path:)

~~~
catacombs
It's cute.

------
stakhanov
Was that supposed to say Petabytes? Gigabytes is really not that impressive.

~~~
drb91
It doesn’t seem like the size is supposed to be impressive, although I do not
know why it is in the title. This is about the use of make.

~~~
stakhanov
...well, that was what I was trying to point out.

------
usgroup
Tbh I would have opted for Jenkins with declarative pipelines in his
situation. Then you get logging , events , cron , CI , all more or less out of
the box .

IMO, this is a bad use case for make.

