
'Big Data' Has Come to Mean 'Small Sampled Data' - privong
https://www.forbes.com/sites/kalevleetaru/2019/02/17/the-big-data-revolution-will-be-sampled-how-big-data-has-come-to-mean-small-sampled-data/
======
randcraw
I really _like_ this article. It reflects a historical perspective that I find
lacking in most tech media, where the limits of data and method are almost
always overlooked, and it's imagined that the world is entirely different
today than it was in 1990. In fact, the world is very much the same today --
it's the COMPUTERS that are different (and the network). We're just looking at
it through glasses with tinted lenses.

(BTW, I've also worked with petabyte-scale social media data, and the author's
suggestion that we often don't ask clear questions or use clearly discriminate
methods is spot on. Too often we imagine that deeper dives into higher
resolution data will yield more and better information. But we very rarely
test that assumption.)

I think the core problem is, we seldom take the effort to understand the
limits of our big data or its sources. And that's because, in the end, nobody
really cares if our datamined conclusions are real or imaginary, just as long
as the graphs look cool.

~~~
grigjd3
I also work at a petabyte scale web media company and what I've learned is
we're not selling accuracy. We allowing ad campaign managers to use buzzwords
with their employers.

------
code4tee
Yes, it usually means “large dataset of mostly junk that someone dumped on us,
which we filtered through ended up with a not really that big dataset and then
analyzed using fairly standard means.”

Although telling people you have giant big data infrastructure that performed
deep learning to power your AI and cognitive automation sounds a lot sexier.

------
chubot
Yes, after doing "big data" at Google, my theory is that 95% of the time you
need ONE MapReduce: stratified sampling.

After that, you just do everything on a single beefy multicore box.

Unfortunately most databases do not support efficient stratified sampling,
although it would be trivial to add as far as I can see. It needs to be hooked
into the same place as GROUP BY.

\----

As an example, pick 1000 URLs from every domain, 1000 clicks to every URL,
1000 web pages in every language, 1000 people from every country, etc. All of
these things have a "long tail", so you need stratified sampling over the
categories, not just straight sampling.

But if you have that, you can answer almost any question with 1 GB of data
IMO.

I think there are two kinds of big data:

1) Production Services like Maps and search, which have to touch every byte

2) Decision making. This can always be done with something like 1 GB of data,
or often 1 MB or 10 KB. You never need 1 TB of data to make a decision about
anything.

~~~
zippy5
I think in general you are right but would like to add a caveat. There is a
trade-off between accuracy and interpretability within any analysis and
usually, it makes sense to bias towards the later. However, it's often worth
verifying that the interpretation extrapolates in a relatively unbiased way.

------
iblaine
"Perhaps the biggest issue is that we need to stop treating “big data” as a
marketing gimmick." says the guy who did a TED talk titled 'What Can Big Data
Do For You'.

Also, comparing IBM PS/2 disk transfer rates to a 10PB cluster in a cloud
makes absolutely no sense.

------
fmajid
Good. I've always thought throwing distributed brute force at a problem is a
sign of laziness. You can get 32-core systems with 128GB of RAM for less than
$10K, and they will have orders of magnitude faster turn-around time than your
average Hadoop or Spark cluster.

~~~
diehunde
I would say the opposite. Vertical scaling is more lazy than horizontal and
also much less scalable.

~~~
CyberDildonics
The point is that it is often not nearly as necessary as it might seem,
because instead of doing computations on large chunks of organized data, lots
of computers end up doing latency bound tasks.

------
monksy
I'm not sure the author fully understands what big data is and where his point
actually lies.

Big data is about processing large amounts of data in a fashion that can get
your results. This is typically where distributed/clustered systems lies and
often deals with system optimization.

If your data requires fast enough results and you need answers quickly
sampling is done. Now he does have merit in his argument that considerations
about sampling or approaches to sampling needs to be considered. Big data can
get his accurate results on the population of the data. (I worked with a
company with 35 PB of images, they did get analyzed via ML and image
processing algorithms. There was very little "sampling" involved.)

My bone to pick with "big data" is that it tends to be ambiguously thrown
around to have cookie cutter systems such as Hadoop/spark or a large db (i.e.
redshift) that don't work for the needs of the business. (Hadoop is horrible
for everything except for 1-time batch ops, or image stitching)

------
tjpaudio
That author fails to grasp the concept of good enough, or optimal marginal
return. Subsampling is often done because you can get the level of prediction
desired at a much lower cost, both in terms of hardware and employee hours.
Those projects that can glean value from full data usage certainly go for it,
but always using full data on everything would be a poor decision.

~~~
pixl97
"We randomly sampled swans in the data set, they are all white"

"Then explain this black swan"

~~~
geebee
You do raise a fair objection, but "black swan" events are a known issue where
it comes to sampling.

For example, an answer to your question could be: "you asked us to find the
average wing span to beak size ratio for male and female swans. Including the
black swan doesn't change it at all, and the massively larger sample set
doesn't improve our accuracy".

------
cwyers
Is... is the author judging the change in storage versus compute by comparing
the size of disk drives to the... Ghz of processors involved?

> My IBM PS/2 Model 55SX desktop in 1990 had a 16Mhz CPU, 2MB of RAM and a
> 30MB hard drive. That’s roughly one Hz of processor power for every 1.87
> bytes of hard drive space.

Yes, yes he is. I just... that seems like such an absurd comparison to be
making.

~~~
romwell
You know, the Big O notation is all about relating _number of operations_ to
_input size_. Ghz is, roughly, proportional to _number of operations_ in a
fixed time frame (let's not worry about memory bandwidth for a second), HDD
size is proportional to _input size_ , so this comparison makes sense.

See it this way: for an O(n) algorithm to continue being usable, GhZ/GB ratio
needs to stay constant over time.

~~~
cwyers
Except CPUs hit 3ghz back in 2002 and have, in terms of what most of us use,
stayed there. The grown in CPU clock cycles has not kept pace with the
improvements in CPU speed that have come from superscalar CPU designs and
specialized instruction sets. On the timeframes being discussed, it's a
comparison that will underrate how much CPUs have changed since then, even on
a per-core basis.

~~~
snovv_crash
We've started doubling, doubling and doubling again the cores and memory
bandwidth though, so our dataset size to cpu*ghz would be still comparable.
The only thing this doesn't take into account is IPC.

~~~
cwyers
> Fast forward almost 30 years to 2019. A typical commercial cloud VM “core”
> is a hyperthread of a 2.5Ghz to 3.5Ghz (turbo) physical processor core with
> 2GB-14GB RAM per core and effectively infinite storage via cloud storage,
> though local disk might be limited to 64TB or less. Using the same CPU-to-
> disk ratio as 1990, that 3.5Ghz turbo hyperthread would be paired to 6.6GB
> of disk (actually, 3.3GB of disk, taking into account the fact that a
> hyperthread is really only half of a physical core).

The author "controls" for the increase in cores, though.

This really is a pretty simple point. A modern x86-64 CPU is not just a 386SX
running at a higher clock speed. You cannot compare CPUs across decades only
on clock speeds. Nothing works like that.

~~~
romwell
>Nothing works like that.

Sure. Thrown in another _order of magnitude_ in to account for that. Heck,
throw in two! Why not say that a single 4GhZ CPU core today has a processing
power of 100 4GhZ CPU's of yore.

You're still _orders of magnitude behind_ the growth of disk space.

And that only makes sense for _very simple and fast_ algorithms. Anything
super-linear is still very hard with the amounts of data we are able to
collect. Which is the author's point.

------
buryat
As someone who works in the field, I agree with the general premise: first
everyone tried to collect all the data and now they're trying to figure out
what to do. However, his supporting arguments are laughable especially where
the author compares bytes per hz. A typical low-quality piece from forbes.

------
dredmorbius
I'd just commented on this at HN a few days ago[1], noting both that a fair
amount of "big data" is actually surprisingly small, and that sampling-based
methods are very, very, very often not only _good enough_ , but tremendously
reduce costs and complexity. There are other factors, some of which favour the
smaller-is-better approach, some of which favour more comprehensive
approaches.

Background: though my pseudonymity precludes giving specifics, I've worked in
and around data analytics for much of three decades, across various fields.
What was "big data" at the onset of my career -- when multiple departments of
over thirty analysts shared access to a few GB of storage distributed over
multiple cluster systems -- is now something I can trivially tackle on a
modest and decade-and-a-half-old desktop system. Newer boxen, with yet more
memory and SSD or hybrid drives are even more capable.

Under my present 'nym, I've done analysis especially of Google+ since 2015
(largely out of personal interest and realising methods _were_ available to
me). Much of that has been based on sampling methods, some has relied on more
comprehensive analysis, though again, with an eye to limiting total data
processing.

In the case of Google+, the questions of how many actively-posting users and
how many active Communities exist have come up. Sampling methods have been
_reasonably_ useful in assessing these.

Google's sitemaps files (from
[https://plus.google.com/robots.txt](https://plus.google.com/robots.txt))
supplied _total_ counts of profiles, communities, and several other
categories. This itself takes some work -- there are 50,000 Profile sitemap
files, and 100 Communities files, containing 3.4 billion and 8.1 million
entries, respectively. Even downloading _just_ the Profile sitemaps itself is
roughly 37 GB of data.

For a _rough_ estimate of active profiles, _it only took about 100 randomly
sampled observations to come up with pretty solid value:_ about 9% of profiles
had posted publicly ever (a ratio which remains true). _Finer-level_ detail
for more frequently posting accounts takes a far larger sample _because those
highly-active profiles are so much rarer_. I'd done my first analysis based on
about 50,000 records, while Stone Temple Consulting independently confirmed
and extended my analysis to find an estimate of profiles posting 100+ times
per month (about 50,000 across the full site). That is, 0.0015625% of all G+
profiles extant at the time (about 3.2 billion).

When I first tackled the G+ Communities question, I was trying to get a sense
of:

1\. How many there were.

2\. What typical membership was.

3\. How many were "reasonably active", based on ... somewhat contrived
measures.

I adopted a sampling approach -- it was easy to pull a full _listing_ of
Communities, but web-scraping these would take time (at about 1.5s per URL).
So my first approach utilised a 12,000 record sample. That gave a good overall
view, but turned out to be thin on the very largest communities, so I ran a
second pull based on 36,000 records, also randomly sampled. (A test of
sampling: how many communities should be in both samples. The expected and
actual result matched: 53.)

Even then, representation from the very largest communities was thin.
Fortunately, I found someone able to sample all 8.1 million communities
rapidly, and from this compiled a list of 105,000 communities, with 100+
members _and_ posting activity within the 31 previous days (based on Jan 5-6,
2019).

Note that this highlights another aspect of media: of _all 8.1 million_
communities, only 1.3% fell into the active list. That is _media and community
activity tend to focus most ACTIVITY on a VANISHINGLY small subset of total
members, groups, posts, or other entities._ That is, attention is highly
rivalrous, and is _exceedingly_ unequally distributed.

The challenge is in finding the active entities. Given my approach, via
sitemaps, this was difficult (though a few options may have been available).

The upshot is that in the case of both users and communities, the _really
active_ and _interesting_ set is, by modern standards, _small data_. There are
50,000 to a few million active users, a few thousands (and far fewer than
100,000) truely active communities. Even the total post volume of G+
communities is, by modern standards, strikingly small -- from January 2013 to
January 2019, about 300 GB of post text was submitted. Image content and the
surrounding Web payload (800 kiB per post) inflate that considerably, but yes,
_over half a decade_ of a reasonably large social media's community
discussions could sit comfortably on most modern hard drives, or even much
mobile storage.

I'll skip over the questions of truly social analysis of such data, with
connections between users and other entities, though I'll note that _even for
very large systems_ , managing these at scale is difficult and the results may
not be particularly usefully interpreted. Combinatorial maths remains
challenging, even with beefy iron.

________________________________

Notes:

1\. See:
[https://news.ycombinator.com/item?id=19294024](https://news.ycombinator.com/item?id=19294024)
and
[https://news.ycombinator.com/item?id=19294004](https://news.ycombinator.com/item?id=19294004),
on the often surprisingly _small_ scale of "big data" and the value of
sampling, respectively.

------
kemitchell
Hey! It means something now!</titlesnipe>

