Hacker News new | past | comments | ask | show | jobs | submit login

My experience with "Big Data" is pretty dated, 5 years at least. At that time I think a good cutoff for "big data" might have been like a petabyte +/- a factor of 10 depending on your gear. I imagine now even 1PB is probably pretty mild by "big data" standards.

But once you're up in that "I can't even fit this in an 4-8U sled" territory (whatever it is in a given decade) you're probably doing some kind of map/reduce thing, so there's a strong incentive to have a column-major layout. If you can periodically sort by some important column so much the better (log2 n binary search), but mostly you've got a bunch of mappers (which you work hard to get locality on relative to the DFS replicas where the disks live, maybe on the same machine, maybe in the same top-of-rack switch or whatever) zipping through different columns or column sets and producing eligible conceptual "rows" to go into your "shuffle/sort/reduce" pipeline to deal with joins and sorts and stuff like that.

I don't know how Google does it, but I think most everyone else started with something like the Hadoop ecosystem and many with something like Hive/HQL to give a SQL-like way to express that job, especially for ad-hoc queries (long-lived, rarely changing overnight jobs might get optimized into some lower-level representation).

Around the time I was getting out of that game, Spark was starting to get really big, which was due to some combination of RAM getting really abundant and just kind of a re-think on what was by then a pretty old cost model. I have no idea what people are doing now.

I'd love it if someone with up-to-date knowledge about how this stuff works these days chimed in.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: