Hacker News new | past | comments | ask | show | jobs | submit login

One time I met a company who insisted they were sending tens of TB of data per day and would need multi-PB per year storage compressed. Took one look at the data: All json, all GUIDS and bools. If we just pre-parse it, the entire dataset for a year fits in a few 100s GB uncompressed -- literally could fit on a macbook air for most of the year.

The funny thing about "big data" in my experience, is just how small it actually becomes when you start using the right tools. And yet so much energy goes into just getting the wrong tools to do more...




> The funny thing about "big data" in my experience, is just how small it actually becomes when you start using the right tools.

Rings way too true for me atm.

My current workplace is currently struggling, because one of our application stores something like a combined 300G of analytics data in the database with the application data. Modifying the table causes hours of downtime because everyone claims that backwards compatible db changes are too hard. And everyone is scared because with more users there's "so much more analytics data" incoming. Yes, with 300Gb across 3-4 years.

And I'm just wondering why it's not an option to just move all of that into one decently sized mysql/postgres instance. Give it SSDs, 30 - 60Gb of ram for the hot dataset (1-2 month) and it'll just solve our problems. But apparently, "that's too hard to do and takes too much time" without further reason.


All of the source text for all of Google+ Communities posts is a few hundred GB. This for 8.1 million communities and ~10 million active users.

Add in images and the rest of the Web payload (800 KiB per page), and that swells to Petabyte range. But the actual scale of the user-entered text is stunningly small.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: