Hacker News new | past | comments | ask | show | jobs | submit login

Fine print: But it might cost you $300K CapEx or $800K/yr OpEx. Hope you have a budget!

You know what else costs? Humongous amount of servers to run silly stuff to orchestrate other silly stuff to autoscale yet else silly stuff to do stuff on your stuff that could fit into memory and be processed on a single server (+ backup, of course).

Add to that small army of people, because, you know, you need specialists of variety of professions just to debug all integration issues between all those components that WOULD NOT BE NEEDED if you just decided to put your stuff in memory.

Frankly, the proportion of projects that really need to work on data that could not fit in memory of a single machine is very low. I work for one of the largest banks in the world processing most of its trades from all over the world and guess what, all of it fits in RAM.

There's a world outside of web apps and SV tech companies. There's a lot of big datasets out there, most of which never hit the cloud at all.

Story time: I worked on one project where a single (large) building's internal sensor data (HVAC, motion, etc. 100k sensors) would fill a 40TB array every year. They had a 20 year retention policy. So Dell would just add a new server + array every year.

I worked with another company that had 2000 oracle servers in some sort of franken-cluster config. Reports took 1 week to run and they had pricing data for their industry (they were a transaction middleman) for almost 40 years. I can't even guess the data size because nobody could figure it out.

This is not a FAANG problem. This is an everage SME to large enterprise problem. Yeah, startups don't have much data. Most companies out there aren't startups.

By the way, memory isn't the only solution. In the past 15 years, I've rarely worked on projects where everything was in memory. Disks work just fine with good database technology.

> Story time: I worked on one project where a single (large) building's internal sensor data (HVAC, motion, etc. 100k sensors) would fill a 40TB array every year. They had a 20 year retention policy. So Dell would just add a new server + array every year.

That's a lot of data, but what do you even do with it other than take minuscule slices or calculate statistics?

And for those uses, I'd put whether it fits in RAM as not applicable. It doesn't, but can you even tell the difference?

They paid us $600K every six months to analyze the data and suggest adjustments to their control systems (it's called continuous commissioning, but it's not really continuous due to laws in many places about requiring a person in the loop on controls). They saved millions of dollars every year doing this, because large, complex buildings drift out of optimized airflow and electricity use very quickly.

Agreed that 20 year retention is silly. We thought it was silly, but the policies reflected the need for historical analysis for audit purposes.

It does in fact matter what you can fit in RAM though. We had to adapt all our systems to a janky SQL Server setup that was horrible for time series data and make our software run on those servers. RAM availability for working sets was a huge bottleneck (hence the cost of analysis).

This is another problem. Is really 20 year retention policy necessary for ALL sensor data? Can it be somehow aggregated and only then the aggregated data to be subject to retention policy? Can the retention policy be made to make it possible to lose some fidelity gradually (the way RRDtool is used by Nagios, for example)?

I really don't understand comments like this.

Yes your company's data may fit in RAM. But does every intermediate data set also fit in RAM ? Because I've also worked at a bank and we had thousands of complex ETLs often needing tens to hundreds of intermediate sets along the way. There is no AWS server that can keep all of that inflight at one time.

And what about your Data Analysts/Scientists. Can all of their random data sets reside in RAM on the same server too ?

Buy them a machine each.

$100K has always been "cheap" for a "business computer" and today you can get more computer for that money than ever.

$100K of hardware (per year or so) is small-fry compared to almost every other R&D industry out there. Just compare with the cost of debuggers, oscilloscopes and EMC labs for electronic engineers.

My company has over 400 Data Scientists and 1000s of Data Analysts.

Buy them a machine each at a cost of $40-60 billion ?

Or would it make more sense to buy one Spark cluster and then share the resources at a fraction of the cost.

I don’t get your numbers, getting one for 400 people is 40M, for thousands it may be 100-999M.

Still expensive, but much less than 40 billion.

He said $100k for each user but it's a dumb idea anyway.

We have a Spark cluster which supports all of those users for $10-$20k a month.

Never said EVERY data set fits in RAM, but that doesn't mean MOST of them don't.

There is a trend, when the application is inefficient, to spend huge amount of resources on scaling it instead of making the application more efficient.

There are a number of cloud database solutions that are very easy to manage and not all that expensive. For example I work for Snowflake and our product doesn't need a small army of people to babysit it.

I mean, I also prefer doing things on a single machine, but if that machine gets expensive enough, or writing a program that can actually use all that power gets too difficult, why not switch to a cloud database?

This is about saving you money, so that's not the right fine print.

Look at it this way: This is for the person that's already going to get enough ram sticks to fit the entire data set or multiple of it, across many machines, and deal with the enormous overhead from doing queries across many machines. The revelation is that you can fit that much ram inside a single machine for a much cheaper and faster experience.

it will cost you 100K CapEx and will save you many 100s of Ks in opEx and untold amount of money in development cost.

Meh, that’s like half the budget of the databases on our dev environment.

We have way too many fucking copies of our DB.

I prefer using mocked data in dev. Smaller dataset, no possibility of PII leaks.

I would love to know how you came up with these numbers.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact