Hacker News new | past | comments | ask | show | jobs | submit login

Even more importantly, does your data have to fit in RAM?

There are tons of problems that need to process large data, but touch each item just once (or a few times). You can go a really long way by storing them in disk (or some cloud storage like S3) and writing a script to scan through them.

I know, pretty obvious, but somehow escapes many devs.

There's also the "not all memory is RAM" trick: plan ahead with enough swap to fit all the data you intend to process, and just pretend that you have enough RAM. Let the virtual memory subsystem worry about whether or not it fits in RAM. Whether this works well or horribly depends on your data layout and access patterns.

Don't even need to do that. Just mmap it and the virtual memory system will handle it.

Interesting. Can you provide some examples of where this is the correct approach?

This is how mongodb originally managed all its data. It used memory mapped files to store the data and let the underlying OS memory management facilities do what they were designed to do. This saved the mongodb devs a ton of complexity in building their own custom cache and let them get to market much faster. The downside is that since virtual memory is shared between processes, other competing processes could potentially mess with your working set (pushing warm data out, etc). The other downside is that since your turning over the management of that “memory” to the OS, you lose fine grained control that can be used to optimize for your specific use case.

Except nowadays with Docker / Kubr you can safely assume the db engine will be the only tenant of a given vm /pod whatever so I think it’s better to let OS do memory management than fight it

Might not be exactly the same use case, but a simple example is compiling large libraries on constrained/embedded platforms. Building OpenCV on a Pi certainly used to require adding a gig of swap.

With the Varnish HTTP cache the authors started out with a very "mmap or bust" type of approach, but later added a malloc-based backend.

Escapes many devs? Really? I used to work with biologists who thought they needed to run their scripts on a supercomputer because the first line read their entire file into an array. But if I saw someone who calls themselves a "dev" doing this I'd consider them incompetent.

I once got into an argument with a senior technical interviewer because he wanted a quick solution of an in-memory sort of an unbounded set of log files.

Needless to say I wasn't recommended for the job, and it taught me a valuable lesson: if you don't first give them what they want, you can't give them what they actually need.

Plenty of devs that don't do any sort of file streaming, say those who started with Game Maker or another specialized domain

I've spent a lot of time writing Spark code, and its ability to store data in a column oriented format in RAM is the only reason why - disk is goddamned slow.

As soon as you're touching it more than once, sticking it in RAM upon reading makes everything much faster.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact