
On Counting (2017) - cdoxsey
http://www.doxsey.net/blog/on-counting
======
msteffen
Sharding a simple bash script across multiple nodes is one of the original
(and still one of the most common) uses of Pachyderm.

We actually have a tutorial based on it:
[https://github.com/pachyderm/pachyderm/blob/master/doc/examp...](https://github.com/pachyderm/pachyderm/blob/master/doc/examples/fruit_stand/README.md)

(disclosure: I work at Pachyderm.
[https://pachyderm.io/](https://pachyderm.io/))

~~~
hinkley
Or the bland and boring version: bucket sort and xargs.

------
bo1024
It would have been cool to see a discussion about algorithmic solutions,
rather than solutions based on tools like MySQL.

There are three basic approaches: sort and remove duplicates (the original
bash script); insert all items into a set (e.g. hash table) that only keeps
unique copies, and count its size; or probabilistic solutions like Count-Min-
Sketch or HyperLogLog. But the problem with the latter is that they are
approximate, which doesn't sound ideal when billing customers.

The problem with both of the first two approaches is that they require all
items to be stored in memory at the same time. As long as that's true, either
the sort or hashtable approach will work fine. But once you run out of RAM on
a single machine, it's going to slow way way down as it swaps to and from disk
constantly.

To me, the natural solution is to just split the dataset alphabetically into,
say, 10 or 100 equal-size jobs, and run these either sequentially or in
parallel on 10 or 100 machines. So for example if the unique IDs are random
strings of digits, then everything starting with 00 is in the first job,
everything starting with 01 is in the second, up to 99. For each job, apply
either the sort or the set approach; shouldn't matter much.

(edit) For example, here's sequential pseudocode; the second step is
embarrassingly parallel.

    
    
        # split the records
        for each record in records_list:
            prefix := record[0:2]
            write record to file "records"+prefix
    
        # count
        total = 0
        for each records_prefix_file:
            initialize hash_table
            for each record in this records_prefix_file:
                insert record into hash_table, ignoring if already present
            total += size of hash_table
    

(second edit) I'm a theorist, not a practicioner, so I'm ignoring many
practical issues about where to store and back up the data, etc.

~~~
enedil
So you're proposing distributed radix tree?

~~~
bo1024
You can think of it that way, but much simpler. Only one layer really.

------
saagarjha
> The oddly named wc is a command used to count things.

It’s named after _w_ ord _c_ ount.

~~~
AstralStorm
Still oddly named. Why isn't it called count or cnt?

~~~
Ensorceled
wc shares an environment with grep, awk, sed and vi; wc is NOT oddly named.

~~~
Sir_Cmpwn
For those interested in the etymology of these:

grep is from the ed command (which itself is short for editor, the original
Unix text editor) g/re/p, which Globally searches for a Regular Expression and
Prints it.

sed is Stream EDitor.

vi is VIsual editor, which is so named because it's like ed but shows you what
you're editing.

Not sure where the name awk comes from.

All of these tools are closely related to the venerable ed(1) command.

~~~
Wald76
Awk is named after its authors at Bell Labs: Aho, Williams, and Kernighan.

~~~
magoghm
Aho, Weinberger, and Kernighan.

~~~
Wald76
You are so right, thank you!

------
monochromatic
Is there a fast way to detect duplicates when you first generate the records?
If so, could just keep a continuously updated counter for each client,
incrementing it every time you add a record, and decrementing on duplicates to
avoid double counting.

~~~
Jupe
Agreed.

The analysis and proposed solutions all seem to ignore the one important
aspect of this data: time.

Why would you wait until the end of the month to count the recorded activities
on the first of the month? And, what happens to failed activities that are
repeated on the first of the next month? Are customers double-billed? If I'm
interpreting the scenario correctly, a large network outage on April 30th
would see a large number of duplicate charges as they are re-applied in May.

Counting during the send process makes much more sense. Duplicates can be
handled in-line, and even detected over month boundaries. If its not "worth
it" in terms of engineering time, then I'd be concerned about the leadership;
the author indicates 'this was the only way we made money'

~~~
cdoxsey
Billing was by month. This was a business decision.

It makes sense too. Billing requires a lot of man hours to pull off. Invoices,
auditing, excel spreadsheets, etc. It's a people problem not an engineering
problem.

I'm not sure of the disconnect here, but de-duplication is not trivial. If you
do it every day nothing is easier than if you do it every month.

Doing it for all time is completely infeasible.

There was not a database of every piece of social media data sent out the
door. That's what you would need to make sure not to record the entry again.
All we had were flat files in s3.

Big flat files. It took hours to download and merge them all.

Once we had a database (Cassandra) it was updated continually (by a kafka
consumer) and we could query it in a few minutes.

~~~
Jupe
Fair enough... but somehow your system knew of failed attempts and the need to
retry, correct? I'm sure I'm trivializing a rather complicated workflow, but
if there's a way to detect failed attempts, there should be a way to remove
them from the billing. (perhaps a separate S3 file that can be used to reduce
the number of charged attempts, or split the ones you have into 'success' and
'fail' versions?)

Oh, and Cassandra is great... used it for years. Just beware of over-sized
results. I've seen run-away queries crash the instance(s) running the query;
ugly stuff.

Glad you found a solution that works!

------
jdironman
This person writes quiet well. Few articles draw me in and keep me there until
the end. His use of language and story telling really flows and it also made
me reminisce of when I used to be passionate about programming myself. Great
writing.

~~~
nyc111
I thought the same thing. Probably because he writes the way he speaks. Or I
would like the secret of it.

~~~
jdironman
Right. I never felt like it 'drawn on and on' like some do. It threw the
technical details in while still keeping you reading. I'm curious now about
his book on learning Go. If it reads the same way it would probably be worth a
read.

------
Radim
If you're into fast cardinality estimation (HyperLogLog) and item counting
(Count-min sketch, Bloom filters, hash tables), check out Bounter:

[https://github.com/RaRe-Technologies/bounter/](https://github.com/RaRe-
Technologies/bounter/)

(Pythonic interface on top of highly optimized algorithms, faster than dict
but using limited memory, MIT license)

------
mfontani
Reminds me quite a bit of [https://adamdrake.com/command-line-tools-can-
be-235x-faster-...](https://adamdrake.com/command-line-tools-can-
be-235x-faster-than-your-hadoop-cluster.html)

… in which, though, the author at least parallelises the counting.

~~~
ILMostro7
Bash 5.0? Parallel execution

~~~
lmilcin
Counting unique occurences isn't easy if you are given data the way it was
described in the article and you have a requirement for exact count.

But there are things you can do to help if you are willing to redesign it a
bit.

You could partition the data up front, as it is being written. Then you could
send partitions to be processed in parallel (and you still could do this with
the shell command!)

------
djhworld
AWS Athena is pretty good at tackling this problem, or PrestoDB running on
EMR.

As long as your S3 data is reasonably partitioned, and you don't have millions
of small files, it does a reasonable job -even on count(distinct)

It even support approx_distinct for hyperloglog estimation too

------
tomtimtall
Rather than first building up a huge pile of logs and then counting them, move
the count to the location that generates the logs or the location that stores
them. Easily done, no need for algorithms or special case solutions, just a
simple set and count. Added bonus, the “task” is axiomatically done when the
month ends no difficulties or special considerations needed there.

------
thanatropism
It's worth mentioning that counting machines are way older than computers and
the Census used to be done with "dumb" punched car munchers.

[https://en.wikipedia.org/wiki/Tabulating_machine](https://en.wikipedia.org/wiki/Tabulating_machine)

------
ikeboy
Why not count daily and incrementally? No reason to wait for the last day and
then have to meet a deadline, do most of the processing in advance and only
need to do a small amount of work each day to update.

~~~
cdoxsey
What if you receive the same tweet 3 days later? It should only be counted
once.

~~~
ikeboy
You dedup against the old data. Easier than deduping a month's worth all at
once, since the old data is sorted.

~~~
cdoxsey
Sorry what I'm getting at is you can't do this problem incrementally. You
can't calculate the count on day 1 and add it to the count on day 2. The count
of each day is intertwined with all the others.

But yes I suppose if at the end of each day you deduped that days records
against the rest of the month you could then add them all together.

Unfortunately day 30 would still be just as bad as doing the whole month,
since removing duplicates is as expensive as counting uniques.

~~~
ikeboy
>removing duplicates is as expensive as counting uniques.

I don't think so. You do all kinds of things (unzip, sort, uniq, plus
apparently 1-2 passes through the whole data). If you did it daily, you only
need to dedup day 30 vs a sorted version of the rest of the month, which takes
around a 30th of the time, and you don't need to unzip/sort/uniq 29/30ths
worth of data. I don't know your exact structure but I don't see why most of
the computation can't be done earlier.

Think of it this way: on day 30 with the script described in the post, for the
first $X amount of time you're reading the beginning of the files and don't
touch any day 30 data until later in those 16 hours. So there's definitely
some kind of processing that could be done prior to day 30.

------
halayli
aws S3 Select might be a good fit.

[https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-s...](https://docs.aws.amazon.com/AmazonS3/latest/dev/s3-glacier-
select-sql-reference-select.html)

