
Data Sketching:  the approximate approach is often faster and more efficient - yarapavan
http://queue.acm.org/detail.cfm?id=3104030
======
a_bonobo
This is gaining popularity in bioinformatics too.

When you sequence a genome with a sequencing machine in tiny pieces you get
the same piece several times, it's not even coverage, some pieces show up a
lot, some pieces show up rarely so that you have to sequence the genome
several times, so to speak (30 to 80 times is the norm).

A lot of that data is redundant to get to those rare pieces, so 'data
sketching' is getting more popular.

Examples:

\- The Abyss genome assembler has (experimentally) a Bloom filter to decrease
memory requirements and run time at the expense of accuracy and having to
fiddle with the Bloom filter parameters. The paper assembles a genome using
35GB o RAM instead of a few hundred GB.

\- The khmer package contains many functions to downsample sequencing runs
using, again, Bloom filters. It also has a HyperLogLog implementation to
quickly count k-mers (similar to n-grams, but with ATGC instead of words)

\- MinHash sketching is a popular first step in aligning messy PacBio reads
(implemented for example in the Canu assembler). People also use this to map
population structure using sequencing reads directly, and to quickly assign a
taxon to the data of an unknown species.

\- There are some approaches with the Oxford Nanopore sequencer (that tiny
thing you can hook up to your laptop) where you stream the incoming data
directly into the assembler or other program and tell the sequencer to stop
sequencing once you found what you're looking for, or your coverage is high
enough

Etc. pp. It's a goldmine for fast algorithms, I haven't seen people use a
Cuckoo filter, for example

------
yarapavan
In response to this challenge, the model of streaming data processing has
grown in popularity. The aim is no longer to capture, store, and index every
minute event, but rather to process each observation quickly in order to
create a summary of the current state. Following its processing, an event is
dropped and is no longer accessible. The summary that is retained is often
referred to as a sketch of the data.

This article introduces the ideas behind sketching, with a focus on
algorithmic innovations. It describes some algorithmic developments in the
abstract, followed by the steps needed to put them into practice, with
examples. The article also looks at four novel algorithmic ideas and discusses
some emerging areas.

------
fabiofumarola
It is interesting to see that this topic which is around the academic world
since 2004 is gaining popularity only now.

An interesting thing is that you can find some of these concepts implemented
in apache spark (count approx) and into BlinkDB. While, from a theoretical
point of view many result are relate d to statistical bound like Chernoff
bound and Chebyshev inequality. During my phd I found this topic interesting
and promising to answer to mine data with a verified error bound for
precision/recall and elapsed time.

To go deeply into this topic I suggest the book Knowledge Discovery from Data
Streams
[http://www.liaad.up.pt/area/jgama/DataStreamsCRC.pdf](http://www.liaad.up.pt/area/jgama/DataStreamsCRC.pdf)
and other books written by Joao Gama.

~~~
lorenzhs
Agreed, streaming algorithms and sketches are a fascinating topic. The
article's author, Graham Cormode, has been conducting research in that area
for a long time and is one of the co-authors of the _Count-Min Sketch_ paper.

Basically a _Count-Min Sketch_ is the same thing as an undersized counting
Bloom filter, but it's used quite differently. This allows for loads of
interesting operations. First, there's the estimated number of occurrences of
an element (how often has it been added?), but it can do more complex things
as well, such as quantiles and heavy hitters.

------
benstrumental
Here are implementations of some of the sketching algorithms mentioned in the
article for those interested:

[https://github.com/ct-clmsn/CHIUW2017](https://github.com/ct-clmsn/CHIUW2017)

------
andrewxhill
We (set.gl) have been working on a few really exciting ways to do this on-
device with primarily location data. Here's the talk this fall at Strava
[https://conferences.oreilly.com/strata/strata-
ny/public/sche...](https://conferences.oreilly.com/strata/strata-
ny/public/schedule/detail/61068)

------
lambdaphagy
Sorting a dataset in order to draw k samples is completely nuts. Pick k random
integers between 1 and N, rerolling any duplicates.

Is your data a stream? Use reservoir sampling for O(n) vs O(n log n).

Sorting a collection of values in order to sample from it doesn't make any
sense.

~~~
lorenzhs
The sampling algorithm you describe becomes pretty slow as _k_ grows. There
are much faster algorithms. We wrote a paper on fast sampling:
[https://arxiv.org/pdf/1610.05141.pdf](https://arxiv.org/pdf/1610.05141.pdf)
(page 11, what you described is labelled "H" (for _hash table_ ) in the plot.
Sorry about the single-letter names, it's a weird convention for sampling
algorithms). You can see that as soon as _k_ is more than a few thousand, "R"
( _" recursive"_) is much faster. What it does is recursively split the range
(initially 1...N) in half, and deciding how many samples come from the left
and how many from the right side. It does this by using hypergeometrically
distributed numbers, which is the distribution for sampling without
replacement. Its base case is the algorithm you describe (H) when there are
less than ~1000 elements left. Algorithm R is also very easy to do in
parallel. ("B_MKL" and "B_GPU" are vectorised, non-portable implementations of
a _B_ ernoulli sampling based algorithm using Intel MKL and Nvidia CUDA,
respectively).

Reservoir sampling can be done much faster than O(n). It's possible in Θ(k
log(N/k)). This is a tight bound, so things like O(k) are impossible, because
every position in the reservoir will be replaced approximately log(N/k) times.
See
[https://www.researchgate.net/profile/Jeffrey_Vitter/publicat...](https://www.researchgate.net/profile/Jeffrey_Vitter/publication/234739284_Random_Sampling_with_a_Reservoir/links/0deec51ac0ca20f508000000.pdf)
for a proof.

But you're right in that sorting in order to draw a sample is not particularly
efficient most of the time. Not to mention that a selection algorithm would be
sufficient as well, and requires expected time O(N):
[https://en.wikipedia.org/wiki/Quickselect](https://en.wikipedia.org/wiki/Quickselect)

