
Investigating the 900GB “Collection #1-5” password leaks with R - grssnbchr
https://timogrossenbacher.ch/2019/03/big-data-journalism-with-spark-and-r/
======
aw3c2
> We’re talking about 31 million lines of data here (all lines with Swiss
> email addresses in the leak). On our workstation, reducing this to the
> distinct email addresses (approx. 3,3 million) took a mere 10.099 seconds (4
> + 6 + .099).

That's just 700MB of data... On my 5 year old middle-class PC this took 12
seconds with sort and uniq.

    
    
        $ wc -l email30m
        31000000 email30m
        
        $ time sort email30m | uniq > email30m.uniq
        
        real 0m11.498s
        user 0m33.833s
        sys 0m0.933s
        
        $ wc -l email30m.uniq 
        2172320 email30m.uniq
    

This automatically split the task into all available CPU cores.

~~~
munro
I'm getting 2 minutes and 29.98 seconds, with my 2017 MacBook Pro with 8 cores
+ 16 GiB ram. I'm using GNU tools, could you share more details on how you're
getting 12 seconds?

    
    
        $ time awk 'BEGIN{for(i=0; i<3100000; i+=1) for(y=0; y<10; y+= 1) print i "@gmail.com"}' > email30m.sorted
        awk  > email30m.sorted  8.29s user 0.82s system 97% cpu 9.352 total
    
        $ ls -lh email30m.sorted
        522M Mar  8 17:14 email30m
    

I didn't have your email30m dataset, so I fabricated one, mine is smaller at
522M and has less entropy, since 'm assuming everyone uses gmail & uses
numbers for usernames. :)

    
    
        $ time shuf email30m.sorted > email30m
        shuf email30m.sorted > email30m  15.58s user 1.69s system 98% cpu 17.561 total
    

I also shuffled the dataset, because that would be cheating.

    
    
        $ wc -l email30m
        31000000 email30m
    
        $ time sort email30m | uniq > email30m.uniq
        sort email30m  873.55s user 4.29s system 585% cpu 2:29.98 total
        uniq > email30m.uniq  4.10s user 0.34s system 2% cpu 2:29.98 total
    
        $ wc -l email30m.uniq
        3100000 email30m.uniq
    

I followed what you did, but it took way longer when running on my machine.

    
    
        $ time sort email30m.sorted | uniq > email30m.sorted.uniq
        sort email30m.sorted  413.75s user 3.66s system 468% cpu 1:29.04 total
        uniq > email30m.sorted.uniq  4.11s user 0.40s system 5% cpu 1:29.04 total
    

And just out of curiosity, it takes 1 minute 29 seconds to sort & uniq the
presorted dataset.

~~~
mappu
I ran all your commands above on Debian Buster (4-core i5 2500 from 2011, 8GB
ram, SATA SSD). My /tmp apparently isn't a tmpfs mount.

Results: [http://paste.debian.net/1072430/](http://paste.debian.net/1072430/)
It took 54s to sort|uniq and 38s for the presorted dataset.

EDIT: For the unsorted dataset: 35s for `sort --parallel=8 -u` and 37s for
`sort --parallel=4 -u`.

------
girst
You can actually get a lot just from grep (well, zgrep, which searches
compressed files)

    
    
        zgrep -A '@\[^.]\+\.ch' /path/to/collections/**/*.gz
    

will spit out (almost) all swiss email addresses. sprinkle some grep -v, wc
and sort|uniq here and there and bob's your uncle :-)

(I'm in no way trying to dismiss the efforts--my method won't get you as much
detail, but is enough to find you and your friends who asked to be searched in
the dumps)

~~~
jmcphers
Previously on HN, a similar approach demonstrating that command-line tools can
be hundreds of times faster than the fancy distributed approach (for certain
classes of problems):

"Command-line tools can be faster than your Hadoop cluster"

[https://news.ycombinator.com/item?id=8908462](https://news.ycombinator.com/item?id=8908462)

~~~
girst
It's not just about "raw" speed, but also about how much mental time you spend
thinking about your setup. even if you didn't know about zgrep already, that's
at most minutes of thinking ahead vs. potential hours or days setting up cloud
providers, big toolchains, and whatever else just to get started.

since we're throwing links around, the (tangentially related) Knuth vs.
McIlroy saga comes to mind:
[https://news.ycombinator.com/item?id=15264997](https://news.ycombinator.com/item?id=15264997)

~~~
obvious_throwaw
> even if you didn't know about zgrep already

Interestingly, the dumps come with a brief README file telling you... to use
the zgrep and example arguments :)

------
usgroup
This is cool and all, but one could smash this out with gnu parallel and awk
with no problem.

Almost everything mentioned can be done on a stream including the aggregates
with non of the Spark fanfare.

You could also happily do this with PipelineDB in a single SQL query.

What is the magic allure of Spark?

~~~
saberience
The magic allure of spark is when you have a team of 20-30 analysts and 10
data scientists working on a 4PB dataset stored on an 80 node cluster. Let me
know when you have a solution for this without using big data tools like
Hadoop, spark, etc.

~~~
usgroup
I mean for the problem at hand ... not for anything, ever :)

I think usually the argument for Spark and the like proceeds by just making
the dataset bigger until whatever you’re comparing becomes absurd (e.g.
command line vs Spark). It was the same deal in the map reduce days. Works
like shit for 99% of datasets but it’s the best thing ever when there are no
other alternatives :)

------
kkarakk
Am i wrong in thinking that 900GB barely counts as "big data"?

~~~
mechanical_jane
I like the definition that Big Data starts at the capacity of the biggest
single hardddrive you can buy. So this year it is 14 TB.

------
spectramax
I’d love to see a comparison of sorts with a real life big data set such as
this one for comparing Python, R, Julia, Matlab, etc.

~~~
kkarakk
a real life big data set would be a couple of hundred terabytes of
multidimensional values not 900gb of text

