Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
'Big Data' Has Come to Mean 'Small Sampled Data' (forbes.com/sites/kalevleetaru)
129 points by privong on March 6, 2019 | hide | past | favorite | 37 comments


I really like this article. It reflects a historical perspective that I find lacking in most tech media, where the limits of data and method are almost always overlooked, and it's imagined that the world is entirely different today than it was in 1990. In fact, the world is very much the same today -- it's the COMPUTERS that are different (and the network). We're just looking at it through glasses with tinted lenses.

(BTW, I've also worked with petabyte-scale social media data, and the author's suggestion that we often don't ask clear questions or use clearly discriminate methods is spot on. Too often we imagine that deeper dives into higher resolution data will yield more and better information. But we very rarely test that assumption.)

I think the core problem is, we seldom take the effort to understand the limits of our big data or its sources. And that's because, in the end, nobody really cares if our datamined conclusions are real or imaginary, just as long as the graphs look cool.


I also work at a petabyte scale web media company and what I've learned is we're not selling accuracy. We allowing ad campaign managers to use buzzwords with their employers.


So much this. It's why data engineering ends up being more and more useful, even if it's not sexy. Automating the boring parts of dumping all the ugly data and cleaning up the stuff that is sampled is a better use of resources.


Yes, it usually means “large dataset of mostly junk that someone dumped on us, which we filtered through ended up with a not really that big dataset and then analyzed using fairly standard means.”

Although telling people you have giant big data infrastructure that performed deep learning to power your AI and cognitive automation sounds a lot sexier.


Yes, after doing "big data" at Google, my theory is that 95% of the time you need ONE MapReduce: stratified sampling.

After that, you just do everything on a single beefy multicore box.

Unfortunately most databases do not support efficient stratified sampling, although it would be trivial to add as far as I can see. It needs to be hooked into the same place as GROUP BY.

----

As an example, pick 1000 URLs from every domain, 1000 clicks to every URL, 1000 web pages in every language, 1000 people from every country, etc. All of these things have a "long tail", so you need stratified sampling over the categories, not just straight sampling.

But if you have that, you can answer almost any question with 1 GB of data IMO.

I think there are two kinds of big data:

1) Production Services like Maps and search, which have to touch every byte

2) Decision making. This can always be done with something like 1 GB of data, or often 1 MB or 10 KB. You never need 1 TB of data to make a decision about anything.


I think in general you are right but would like to add a caveat. There is a trade-off between accuracy and interpretability within any analysis and usually, it makes sense to bias towards the later. However, it's often worth verifying that the interpretation extrapolates in a relatively unbiased way.


"Perhaps the biggest issue is that we need to stop treating “big data” as a marketing gimmick." says the guy who did a TED talk titled 'What Can Big Data Do For You'.

Also, comparing IBM PS/2 disk transfer rates to a 10PB cluster in a cloud makes absolutely no sense.


Good. I've always thought throwing distributed brute force at a problem is a sign of laziness. You can get 32-core systems with 128GB of RAM for less than $10K, and they will have orders of magnitude faster turn-around time than your average Hadoop or Spark cluster.


Exactly. Compare the inter-core bandwidth and latency between a beefy multicore system and a distributed system, then use that ratio (of whatever your bottleneck is) to multiply how big your comparable distributed system would be. Usually people assume they are compute bound, when in reality they are memory latency/bandwidth bound, and they neglect to do this comparison when speccing out a distributed system.


I would say the opposite. Vertical scaling is more lazy than horizontal and also much less scalable.


The point is that it is often not nearly as necessary as it might seem, because instead of doing computations on large chunks of organized data, lots of computers end up doing latency bound tasks.


I think you're missing the point the article is making, or at least obstinately not addressing it.


What kind of problems have you worked with that you compared against Hadoop/Spark?


I'm not sure the author fully understands what big data is and where his point actually lies.

Big data is about processing large amounts of data in a fashion that can get your results. This is typically where distributed/clustered systems lies and often deals with system optimization.

If your data requires fast enough results and you need answers quickly sampling is done. Now he does have merit in his argument that considerations about sampling or approaches to sampling needs to be considered. Big data can get his accurate results on the population of the data. (I worked with a company with 35 PB of images, they did get analyzed via ML and image processing algorithms. There was very little "sampling" involved.)

My bone to pick with "big data" is that it tends to be ambiguously thrown around to have cookie cutter systems such as Hadoop/spark or a large db (i.e. redshift) that don't work for the needs of the business. (Hadoop is horrible for everything except for 1-time batch ops, or image stitching)


That author fails to grasp the concept of good enough, or optimal marginal return. Subsampling is often done because you can get the level of prediction desired at a much lower cost, both in terms of hardware and employee hours. Those projects that can glean value from full data usage certainly go for it, but always using full data on everything would be a poor decision.


> Subsampling is often done because you can get the level of prediction desired at a much lower cost, both in terms of hardware and employee hours.

While that's valid, I think the author's concern is that authors frequently do not demonstrate that the subsampling is representative. So the conclusions from the sampled data my not be as accurate as is claimed.

(edit: minor phrasing change)


If only there was an entire field that exists to characterize when subsampling a population is valid and sound.


If only my non-statistical peers would recognize that sampled=fast and fast=more checks and explorations. It's like they recognize that fast compile times are a great thing (I've had to wait in line with punchcards, and it sucks), but they are completely oblivious to the exact same argument when it comes to data.


Why spend years learning those results when you can ignore them at no cost?

And have a happy client that finally gets the results he wanted when the other analysts said it was impossible.


Elsewhere he talks about how subsamples are chosen because they take but seconds to run instead of minutes. If a sample is small enough that you're saving that much time, and analysis is that cheap, you can still do five-fold or ten-fold cross validation in less time than the full data set analysis and get a very good idea on if your subsample is representative of the data or not.


That's what I took from it. Forbes is aimed at the people buying solutions so from the perspective of a CIO or CTO (or even the people they're supporting with their analysis systems), the article is telling them that they may not be getting what they think they paid for.

This is about industry trends, and you CAN get a representative sample in reasonable time, for some definition of reasonable. The takeaway from this article is that what someone in the data space may think is reasonable isn't what someone who just paid for an army of data scientists and a data lake solution because those things are sexy thinks is reasonable.


"We randomly sampled swans in the data set, they are all white"

"Then explain this black swan"


You do raise a fair objection, but "black swan" events are a known issue where it comes to sampling.

For example, an answer to your question could be: "you asked us to find the average wing span to beak size ratio for male and female swans. Including the black swan doesn't change it at all, and the massively larger sample set doesn't improve our accuracy".


If you are doing things right you never say the equivalent of "they're all white" - you give a distribution. Explaining that and what it means is a communications issue, not a data one.


>"Then explain this black swan"

It's not black, it is a very dark white, and you might be looking at it with the wrong lighting.


I didn't get that sense. I think he 1) does understand the value of sampling but 2) is highlighting the false advertising that's often practiced of claiming Big Data based results when what's actually presented is in fact sampling.


Is... is the author judging the change in storage versus compute by comparing the size of disk drives to the... Ghz of processors involved?

> My IBM PS/2 Model 55SX desktop in 1990 had a 16Mhz CPU, 2MB of RAM and a 30MB hard drive. That’s roughly one Hz of processor power for every 1.87 bytes of hard drive space.

Yes, yes he is. I just... that seems like such an absurd comparison to be making.


You know, the Big O notation is all about relating number of operations to input size. Ghz is, roughly, proportional to number of operations in a fixed time frame (let's not worry about memory bandwidth for a second), HDD size is proportional to input size, so this comparison makes sense.

See it this way: for an O(n) algorithm to continue being usable, GhZ/GB ratio needs to stay constant over time.


Except CPUs hit 3ghz back in 2002 and have, in terms of what most of us use, stayed there. The grown in CPU clock cycles has not kept pace with the improvements in CPU speed that have come from superscalar CPU designs and specialized instruction sets. On the timeframes being discussed, it's a comparison that will underrate how much CPUs have changed since then, even on a per-core basis.


We've started doubling, doubling and doubling again the cores and memory bandwidth though, so our dataset size to cpu*ghz would be still comparable. The only thing this doesn't take into account is IPC.


> Fast forward almost 30 years to 2019. A typical commercial cloud VM “core” is a hyperthread of a 2.5Ghz to 3.5Ghz (turbo) physical processor core with 2GB-14GB RAM per core and effectively infinite storage via cloud storage, though local disk might be limited to 64TB or less. Using the same CPU-to-disk ratio as 1990, that 3.5Ghz turbo hyperthread would be paired to 6.6GB of disk (actually, 3.3GB of disk, taking into account the fact that a hyperthread is really only half of a physical core).

The author "controls" for the increase in cores, though.

This really is a pretty simple point. A modern x86-64 CPU is not just a 386SX running at a higher clock speed. You cannot compare CPUs across decades only on clock speeds. Nothing works like that.


>Nothing works like that.

Sure. Thrown in another order of magnitude in to account for that. Heck, throw in two! Why not say that a single 4GhZ CPU core today has a processing power of 100 4GhZ CPU's of yore.

You're still orders of magnitude behind the growth of disk space.

And that only makes sense for very simple and fast algorithms. Anything super-linear is still very hard with the amounts of data we are able to collect. Which is the author's point.


Independent of the usage within the article, I make this kind comparison pretty often when thinking about how to "saturate" a device; if I am building a specialized system to perform a task, I want the capability of the system in terms of Hz per byte of storage to roughly match that of the problem being performed. (This is typically measured against storage bandwidth, not storage total, since you usually have the assumption that you're bringing data in from a larger storage pool that is essentially "infinite" in comparison (cache <---> main memory, or main memory <---> disk, or disk <----> distant server))

If your Hz/byte ratio is too high. you're wasting compute capability. If too low, you're wasting memory/memory bandwidth.


That actually is a very useful ratio to look at. Essentially says “in a compute-bound data processing algorithm, how long would it take to process all the data you can store?”


As someone who works in the field, I agree with the general premise: first everyone tried to collect all the data and now they're trying to figure out what to do. However, his supporting arguments are laughable especially where the author compares bytes per hz. A typical low-quality piece from forbes.


I'd just commented on this at HN a few days ago[1], noting both that a fair amount of "big data" is actually surprisingly small, and that sampling-based methods are very, very, very often not only good enough, but tremendously reduce costs and complexity. There are other factors, some of which favour the smaller-is-better approach, some of which favour more comprehensive approaches.

Background: though my pseudonymity precludes giving specifics, I've worked in and around data analytics for much of three decades, across various fields. What was "big data" at the onset of my career -- when multiple departments of over thirty analysts shared access to a few GB of storage distributed over multiple cluster systems -- is now something I can trivially tackle on a modest and decade-and-a-half-old desktop system. Newer boxen, with yet more memory and SSD or hybrid drives are even more capable.

Under my present 'nym, I've done analysis especially of Google+ since 2015 (largely out of personal interest and realising methods were available to me). Much of that has been based on sampling methods, some has relied on more comprehensive analysis, though again, with an eye to limiting total data processing.

In the case of Google+, the questions of how many actively-posting users and how many active Communities exist have come up. Sampling methods have been reasonably useful in assessing these.

Google's sitemaps files (from https://plus.google.com/robots.txt) supplied total counts of profiles, communities, and several other categories. This itself takes some work -- there are 50,000 Profile sitemap files, and 100 Communities files, containing 3.4 billion and 8.1 million entries, respectively. Even downloading just the Profile sitemaps itself is roughly 37 GB of data.

For a rough estimate of active profiles, it only took about 100 randomly sampled observations to come up with pretty solid value: about 9% of profiles had posted publicly ever (a ratio which remains true). Finer-level detail for more frequently posting accounts takes a far larger sample because those highly-active profiles are so much rarer. I'd done my first analysis based on about 50,000 records, while Stone Temple Consulting independently confirmed and extended my analysis to find an estimate of profiles posting 100+ times per month (about 50,000 across the full site). That is, 0.0015625% of all G+ profiles extant at the time (about 3.2 billion).

When I first tackled the G+ Communities question, I was trying to get a sense of:

1. How many there were.

2. What typical membership was.

3. How many were "reasonably active", based on ... somewhat contrived measures.

I adopted a sampling approach -- it was easy to pull a full listing of Communities, but web-scraping these would take time (at about 1.5s per URL). So my first approach utilised a 12,000 record sample. That gave a good overall view, but turned out to be thin on the very largest communities, so I ran a second pull based on 36,000 records, also randomly sampled. (A test of sampling: how many communities should be in both samples. The expected and actual result matched: 53.)

Even then, representation from the very largest communities was thin. Fortunately, I found someone able to sample all 8.1 million communities rapidly, and from this compiled a list of 105,000 communities, with 100+ members and posting activity within the 31 previous days (based on Jan 5-6, 2019).

Note that this highlights another aspect of media: of all 8.1 million communities, only 1.3% fell into the active list. That is media and community activity tend to focus most ACTIVITY on a VANISHINGLY small subset of total members, groups, posts, or other entities. That is, attention is highly rivalrous, and is exceedingly unequally distributed.

The challenge is in finding the active entities. Given my approach, via sitemaps, this was difficult (though a few options may have been available).

The upshot is that in the case of both users and communities, the really active and interesting set is, by modern standards, small data. There are 50,000 to a few million active users, a few thousands (and far fewer than 100,000) truely active communities. Even the total post volume of G+ communities is, by modern standards, strikingly small -- from January 2013 to January 2019, about 300 GB of post text was submitted. Image content and the surrounding Web payload (800 kiB per post) inflate that considerably, but yes, over half a decade of a reasonably large social media's community discussions could sit comfortably on most modern hard drives, or even much mobile storage.

I'll skip over the questions of truly social analysis of such data, with connections between users and other entities, though I'll note that even for very large systems, managing these at scale is difficult and the results may not be particularly usefully interpreted. Combinatorial maths remains challenging, even with beefy iron.

________________________________

Notes:

1. See: https://news.ycombinator.com/item?id=19294024 and https://news.ycombinator.com/item?id=19294004, on the often surprisingly small scale of "big data" and the value of sampling, respectively.


Hey! It means something now!</titlesnipe>




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: