Hacker News new | past | comments | ask | show | jobs | submit login
When your data doesn’t fit in memory: the basic techniques (pythonspeed.com)
638 points by itamarst 30 days ago | hide | past | web | favorite | 259 comments

Unusual to have no occurrence of the word "stream" anywhere in the article... given that's the usual term for algorithms which require a constant or almost-constant space regardless of the amount of data they process.

A semi-common beginning programmer's exercise is to write a program that numbers the lines in a text file. The naive solution will use O(n) space, while a bit more thought reveals that this can be done in constant (to be really precise, O(log n) where n is the number of lines in the file) space.

Chunking is the term I used because that's more relevant to the data science domain I'm focusing on here (e.g. Pandas has "chunksize", Zarr has "chunks"). Streaming has some implication of an ongoing stream of data to me... but I ought to clarify some of the assumptions about a fixed size of data, yes.

Chunking and streaming are different things to me. Chunking means you get to process multiple rows of data at the same time, usually useful to take advantage of SIMD. Streaming means that the data is accessed in a single pass: once you compute the effect of a given row on your statistics, you never have to rewind to see it again.

Many modern performant solutions will use both, but they're not the same thing.

One important application of chunking is efficient I/O.

Mass storage is most efficient when doing large sequential reads and writes, so you normally feed your constant-space streaming algorithms from buffers with a large number of input records.

Sometimes you can just tell the OS do efficient chunking prefetch for you.

If you're streaming in a language like Python, its IO will be doing some degree of chunking behind the scenes. It might be beneficial to do more manually.

Often called buffering when applied to IO.

netCDF, used for storing e.g. large multidimensional climate datasets, is also "chunked" through its HDF[0] backbone.

The hard part for me is the "transpose" or "striding" problem. i.e. when the data is stored in a series of (x, y, z) files for a given hour, day, or month and I need a time series at a point.

[0] https://en.wikipedia.org/wiki/Hierarchical_Data_Format

> Chunking is the term I used because that's more relevant to the data science

If your intended audience are data scientists then why didn’t you mention Dask?

Ditto. Python multiprocessing also uses “chunks” with a specified chunksize.

Sorry, isn't there quite a bit of a difference between O(1) and O(log N) space?

My first inclination would be to have a counter at zero, read in a line, write out the counter then write out the line, then increment the counter.

The space complexity here is O(1), is it not?

> isn't there quite a bit of a difference between O(1) and O(log N) space?

On a 64-bit CPU, y=x+1 is O(1) time when Y < 2^64, but becomes O(log N) time for larger numbers. With space it’s even more complex, sometimes for space saving it makes sense to keep less than 64 bits of these integers, i.e. it may become log(N) for much smaller values.

For this reason, I’m not a bug fan of big O. Useful at job interviews. Probably useful for people working on extremely generic stuff like C++ STL which needs to scale both ways by many orders of magnitude, it’s not uncommon to have a vector with 1 element or 100M elements, the standard library must be good at both ends. But I don’t often solve such generic problems, I usually have some ideas how large my N-s are, and can act accordingly.

Let's say that our file consists of nothing but the new line character (1 byte). At 2^64 lines, that's 18.45 EB.

Sure, for files larger than 18.45 EB at the minimum, computing the new line count will take O(log N) time.

I'm all for nitpicking the issues with Big O notation, but this instance doesn't strike me as one.

OP said “to be really precise, O(log n)”, to avoid nitpickers that might have added corrections. And it is absolutely true: to store the number N, you need O(log N) bits. That is how Big O notation works.

The lesson to take away here is that O(log N) is REALLY REALLY small, for all sensible values of N. If an O(log N) algorithm is slow, it’s not because of the O(log N) part, it’s because of the hidden factors in O notation.

> That is how Big O notation works.

No it is not. "Big O" is a way of expressing asymptotic growth of some quantity, most often running time. But you need a machine model to even express running time. And sure, you can define your machine model to require O(log n) time for adding two numbers, but that's not particularly useful. In particular, the word RAM model—which is what is normally used—assumes a word size of θ(log n), and that basic operations on those words take a constant amount of time.

We use models because otherwise even the simplest things would be impossible to analyse. Or do you want to include a three-level cache hierarchy, micro-op decoding, the paging system, branch mispredictions and memory timings in your complexity analysis? Why is it always the "bUt AdDinG tWo nuMbERs tAkeS O(log n) tImE" hill that HN commenters are willing to die on?

To be fair he did say O(1), and qualified the O(log N) with both parentheses and “to be really precise”. If you can’t be precise about the definition of O on HN, then where? (which, “to be really precise”, is a polynomial time tape based Turing machine, not a modern RAM machine with log N words. It’s pointless, but fun to remember :) )

No no no! Big O has nothing to do with Turing machines at all! You can, of course, express the number of steps a single-tape deterministic Turing machine needs to solve some problem using Big O notation. But you can just as well express the number of Qubits that a quantum computer needs to solve some other problem in Big O notation. It has nothing to do with the quantity being measured. Are you perhaps mixing up the definition of complexity classes such as P and NP with Big O notation?

First you need to settle on a model. Then you can express time, space, or some other quantity in that model using Big O notation.

Arguably, I've found that HN is one of the worst places to be precise about Big O notation. Everyone here seems to think they know what it is, only to lay bare their misconceptions about it a second later.

You're simply wrong here. The algorithm is O(log n), whether you agree to it or not.

First of all, there's absolutely nothing that says that we have to assume that the number of lines is less than 2^64. Big O notation does not care what happens below that number, it only cares what happens when you go to infinity, which is a lot larger than 2^64. At that point you need arbitrarily sized integers, which are not fixed-space. Hence O(log n). Word size or RAM model does not matter one whit.

This is how Big O notation works, it has nothing to do with practicality. As another example of a similar thing, consider Sorting Networks. The AKS sorting networks has the best "asymptotic depth" of any known sorting networks (O(n log n) depth, i think), but the hidden constants are so massive that there are no situations where the AKS networks are actually practical, for any reasonable input they are always beaten by other (asymptotically worse) networks. That doesn't mean that they "aren't O(n log n)" or whatever.

Second: if we were to actually write this algorithm in the way which is most conservative with space, we wouldn't use a 64-bit integer in the first place. If we only care about memory, we'd start the count with a single-byte datatype (i.e. unsigned char) to keep the count. The algorithm now uses 1 byte of space. When that overflows (i.e. reaches 256 lines), we would change to a bigger data type, i.e. unsigned short. The algorithm now takes 2 bytes of data.

When THAT overflows, we again promote the datatype to an unsigned int. Now we use 4 bytes of data. When that overflows, we finally promote it to a uint64_t. Now we're on 8 bytes of data. When that overflows... (etc.)

See how the memory usage is growing? And how it's growing exactly with the logarithm of the number of lines?

You're just wrong about this. Yes, you can absolutely say "assuming storage of all numbers is O(1), then the algorithm is O(1)" (essentially what OP said) but that is not a default assumption and it's not accurate to reality (since actually storing numbers requires O(log n) of space). And yes, branch prediction misses and cache misses are far more important factors here, but that doesn't change the fact that it uses O(log n) space. That's just simply true.

Also: we're not dying on a hill. OP made an entirely accurate and true statement, which was met with an avalanche of people criticizing him using incorrect arguments and misunderstandings of Big O notation. We're just saying that he was correct all along.

By your logic, memory access takes time Ω(n^(1/3)) because memory cells need to occupy physical space and we live in a three-dimensional reality, so the time to access a memory cell must necessarily increase with the physical distance to that memory cell. But using a model that made such assumptions would be needlessly complex, so nobody does it. (Incidentally, there was a post on HN that argued this a few years ago, and the comments were full of people who felt the need to ignore the reasoning behind the assumptions backing commonly-used machine models, much like it is happening in this thread.)

Arbitrarily large numbers, represented in binary, certainly take time that is linear in the length of their representation to add. Nobody's arguing with that. But using a model that assumes bitwise operations would be needlessly complex, so outside of some really specialized contexts, nobody does it.

> You're just wrong about this. Yes, you can absolutely say "assuming storage of all numbers is O(1), then the algorithm is O(1)" (essentially what OP said) but that is not a default assumption

I don't know the basis on which you argue that the word RAM model isn't the default assumption, but I feel confident in claiming that outside of some niche communities, it most certainly is.

> and it's not accurate to reality (since actually storing numbers requires O(log n) of space).

That's why it's called a model. We use models because reality is too messy to work with, and we usually don't care about the exact details. Just ask a physicist, they make simplifying assumptions all the time because without them, they'd be getting absolutely nowhere. It's pretty much the same in algorithmics.

> By your logic, memory access takes time Ω(n^(1/3)) because memory cells need to occupy physical space and we live in a three-dimensional reality, so the time to access a memory cell must necessarily increase with the physical distance to that memory cell.

The issue was about space complexity, not time complexity. Very large numbers take more space than smaller numbers.

No, the argument was that because a binary representation of the number n needs ⌈log n⌉ bits, incrementing it would require O(log n) time. And my argument was that accessing these bits in memory takes an amount of time that grows with the cube root of your memory size because of physical constraints. But of course I'm not arguing that we should incorporate that into our models. Doing so would not solve any problems, quite the opposite. But assuming that basic operations on machine words can be performed in constant time solves actual problems such as "why is algorithm analysis in this model so bothersome", "why are there logarithms everywhere", and "why do the theory people say this should get slower for larger numbers? It always takes 1 cycle on my machine".

Do you have any resource mentioning that the algorithm runs in O(logn) time or space? This is really new to me, sorry. I don't mean to argue, it's just the first time I hear such a thing.

It takes O(log n) bits to represent the value n, here the number of lines counted so far.

Does it make sense when expressed compactly like that?

For God's sake, for a second I thought I had forgotten everything about the O notation. Thanks for confirming what I thought is the correct definition of big O.

> On a 64-bit CPU, y=x+1 is O(1) time when Y < 2^64, but becomes O(log N) time for larger numbers.

No. Incrementing a 128-bit counter is also O(1) time on amd64/ARM8/... Only when your numbers are arbitrarily large would that become true.

But that's why we use models that make clear what our assumptions are. In particular, the word RAM model (which is normally used to talk about sequential algorithms) assumes a word size of θ(log n), and that some basic operations on words (such as arithmetic) are possible in constant time.

Models obviously have flaws (otherwise they wouldn't be models), but practitioners often make the mistake of overthinking them. How big, exactly, is your file that a constant number of 64-bit integers doesn't suffice? You can easily increment a 512-bit counter in constant time on an 8-bit Arduino. It doesn't matter whether I have to check one hypothetical 512-bit counter, 8 64-bit counters or 64 8-bit counters, it's a constant number and therefore O(1).

Whenever this "Big O is useful for job interviews but not much else" sentiment comes up on HN, it's usually driven by a misunderstanding of what Big O and asymptotic analysis actually mean.

> Only when your numbers are arbitrarily large would that become true.

Technically you’re right, I should have put “arbitrarily large” there, but assumed it’s obvious from the context.

> but practitioners often make the mistake of overthinking them

As an experienced practitioner, I happen to know how extremely inaccurate are these models.

When I was inexperienced, the models helped more then they do now, after I have leared about NUMA, cache hierarchy, micro-ops fusion, MMU, prefetcher, DMA IO, interrupts, and more of these hairy implementation details. Before I knew all these things, heuristics based on asymptotic analysis were generally helpful. Now they’re less so, I sometimes deliberately pick an asymptotically slower algorithm or data structure because it’s faster in reality.

Well yeah, for many problems, there are solutions that have wonderful running time in the RAM model, like van Emde Boas trees or Fibonacci heaps, but are approximately 100% useless in practice because of the constants involved.

I guess this is where I pitch what we call "Algorithm Engineering", which considers both theoretical guarantees and practical performance, including properties of real-world hardware such as the ones you mentioned :) https://en.wikipedia.org/wiki/Algorithm_engineering

> including properties of real-world hardware such as the ones you mentioned :) https://en.wikipedia.org/wiki/Algorithm_engineering

Are you sure about that? The most recent source of that article is from 2000. Half of the stuff I have mentioned didn’t exist back then.

As far as I’m aware, there’re very few people in academia who study modern hardware while focused on software performance. I only know about Agner Fog, from technical university of Denmark.

It's a meta-article, it doesn't contain all the results of the field. But the field is alive and well. Just look at the proceedings of SEA and ALENEX (there was no SEA this year for stupid reasons that have nothing to do with the field).

Agner Fog does some cool stuff (and I've used his RNG libraries many times) but as you said he studies modern hardware with a focus on software performance. Algorithm engineering is about finding algorithms that have nice theoretical guarantees (e.g., proving correctness or that there aren't any inputs which would result in substantially worse running time) while keeping in mind that the result would be quite useless if it couldn't be implemented efficiently on the machines we have, so the goal is to come up with algorithms that are also really fast in practice.

> On a 64-bit CPU, y=x+1 is O(1) time when Y < 2^64, but becomes O(log N) time for larger numbers.

I thought that big-O does not actually consider the underlying hardware. How can you specify the CPU model / instruction conditions and still use the semantics of big-O analysis? The mathematical definition of O(_) does not involve hardware, it implies that there exists some coefficients and a constant. I get the point you are making but is big-O an appropriate way of measuring in this context?

As far as I know, big O notation definitely requires that we define the model of computation model we're using and the operations that this model can perform in constant time. Usually that definition is not made explicit or rigorous.

When say that comparison sort is O(n * lg n), we're assuming a computational model where comparing two integers is a constant time operation. That's true in most physical computers when we use the CPU's comparison instructions. It's also true in any theoretical model where we define a (finite) largest possible integer. So it works well enough.

You are using O(n) to measure the actual performance of an algorithm in practice. This is not the intention of O(n) -- rather it is used to classify functions into groups depending on "how they grow" with respect to the input as it goes to infinity. This is good because different instructions, or sequences of instructions, will perform differently across CPU architectures, for example two CPUs might have different cycles-per-instruction. Big-O makes claims independent of the hardware that remain true forever. Just because algorithm A is O(n) and B is O(n^2) does not mean that A will always outperform B.

I’m not conflating big O notation with benchmarking. My point applies to theoretical models of computation.

I’m pointing out that, for example, comparison sort is only O(n * lg n) in theory if the theoretical model of computation we’re dealing with can compare any two items in constant time. Comparing arbitrarily large integers can not be done in constant time in theory, so if that’s the problem we’re discussing then the complexity will not be O(n * lg n). But again, we generally are implicitly concerned with models of computation where comparison is a constant time operation.

No. Stop. That's not how this works.

Yes, Big O notation measures asymptotic growth of functions. But what do those functions express? For them to express running times, you need to define a machine model. It's a model because it's not a real machine – we don't stroll into a computer shop and pick up some computer and use that as our reference. Instead, we define an abstract machine model. Often, that's the word RAM model which assumes that for an input size n, the machine can process words of size θ(log n) in constant time. This is reasonable close to how our actual computers work: 64-bit words, and we can do stuff with them in constant time. But it's not perfect, because the RAM model doesn't have any caches (let alone multiple levels of cache hierarchy), doesn't have disks, doesn't model the paging system, etc. If any of these things are of significant concern to you, pick a different model (e.g. an external-memory model to consider memory hierarchy, or look into cache-oblivious algorithms, ...).

But talking about asymptotic running time without specifying which model you're using is absolutely pointless.

What you are missing is that " algorithm A is O(n)" is a meaningless statement unless you define under which operations.

typically the n refers to the length of the input. bigger numbers to compare gets cancelled out by the largerness of n.

I don’t think that’s accurate. The length of a collection is not necessarily correlated to the largeness of its largest element. The n in comparison sort is indeed the length of the collection, and that’s the only variable we generally discuss because we generally assume some fixed maximum allowable element.

I am referring to the bit length of the input.

> is big-O an appropriate way of measuring in this context?

I don't have formal CS education. I'm less interested in math and more oriented towards practical aspects. Infinity doesn't exist in practice (except these IEEE float magic values), but we can reason about shape of functions, e.g. "count of executed machine instructions as a function of N".

Not the same thing as CS-defined big O, but close.

Can be useful to reason about complexity of algorithms. More precisely, about complexity of their implementations.

Usually useless to reason about time: a machine instruction takes variable time to run on a modern CPU, the same instruction can take less than 1 cycle or much more than 1M cycles.

The amount of memory needed to hold an integer is log(N). That's just because the log is about how many digits it has. That fact doesn't depend on the hardware, it's just a math fact.

There's generally an assumption that the elementary values we're using fit into native hardware values, and that some elementary operations on those values can be done in constant time (e.g. a CPU instruction).

If you want to talk about something like supporting arbitrarily large integers or defining the memory of the physical computer we're using, we need to be explicit about that, because it changes a lot of things. After all, a finite-memory computer is not Turing complete, and everything it can compute can be computed in constant time. :)

it's a little more complicated than that.

how so?

well I can hold somewhat arbitrarily large integers in memory with the ackermann function, for instance.

If I've read between the lines correctly, your point is that N bits allows you to represent 2^N values, but they don't have to be the values from 0 to 2^N.

yes, sorry for being terse.

everything that terminates is O(1) when you upper-bound the input size

That's still pretty naive. You can get away with one char and one int.

- Read byte from input

- Write byte to output

- When byte is "\n": write the counter, write a space, increase the counter by one

Most likely, whatever logic you use to turn the counter into a string will already use more memory than your actual loop.

Technically it's still not O(1), as you will need one more bit for your counter variable every time you double the number of lines; realistically, you will most likely be using an integer anyway, so you can consider it "constant".

Storing the number n takes log n bits, if we're being super pedantic :)

In practice, I'd say it takes exactly 64 bits for any file.

Asymptotic notation isn't about what's "in practice."

We have very good asymptotic complexity for multiplication, but don't use the ones that scale the closest to linear because the time constant is enormous.

There are different models of computation. You can have a model in which integers can be arbitrarily large yet still count as constant memory (for analyzing practical-size algorithms on 64-bit machines, for example); I'm not sure how common those are. I know for example arithmetic is often assumed to be constant time (it indeed varies by input size, even x=y+1, which is clearly O(log(y))).

Also relevant here is the RAM machine model that is often employed, where you can have essentially unlimited memory that can be randomly accessed at fixed time. This again is roughly in line with real practical machines running algorithms that use, and fit within RAM. In reality if you wanted a more physically consistent model eventually your access times must differentiate and increase for increasing memory -- data occupies physical space and must be fetched at increasingly long distances, limited by the speed of light. In principle this means m bits of memory at best can be accessed in O(m^(1/3)) time [1]. Of course, this realism isn't always practically relevant (maybe for datacenter-scale problems, perhaps even larger) and complicates the analysis of algorithms.

[1]: Just for fun, if you want to get really physically accurate, this isn't quite right either I suspect -- that's because with enough data (physical bits) occupying a volume at constant density it will eventually collapse into a black hole, which kinda destroys your computer :). So for planetary-scale computers you eventually need to spread your data across a disk, like a small galaxy (or Discworld, if you prefer :)), so it won't collapse, giving it O(sqrt(m)) access time. Surprise, very large worlds must be flat.

This is related to the BH entropy formula, and the so called Holographic principle, I guess -- the entropy and hence information content of a volume is surprisingly bounded by its area, since a Black Hole's area is proportional to its mass. The weird thing of course is how the universe itself perhaps should collapse to a black hole since it has a lot of stuff in any sufficiently large volume, being young and dynamic it hasn't happened yet. It's alsogiven by its fractal scale, but I guess this digression has grown large enough already :)

(do tell if you want to learn more)

If my integers can be arbitrarily large and count as constant memory, why can't I just reduce everything to a 2-counter machine and solve every problem in constant space (and a lot of time)?

Because the Big O notation as used by engineers is actually about practical problems and not asymptotical behavior.

So integers count as constant memory AND can be considered to be arbitrarily large for the purpose of the current problem. Because having more than 2^64 lines in a file will never happen in practice, and even if it does then 2 integers would do the job for the rest of the observable universe.

Be pragmatic, you are on Hacker News.

Uh, that's what the theory people do, too. We make assumptions about the model, for example, that if our input is of size n, a machine word should be able to store that number (word size θ(log n) bits). Otherwise we'd never get anywhere with asymptotic analysis.

In my experience, practitioners just have a lot of misconceptions about "Big O".

You can, it just makes for a useless result.

With the O() notation the units are some arbitrary computation step or arbitrary unit of memory, which could even mean matrix multiplication of two 1024x1024 matrices of doubles. The whole point of big-O notation is that you don’t care about how fast/expensive it is, but how it scales.

Yes and the memory to store a number scales with the log of the number

> Unusual to have no occurrence of the word "stream" anywhere in the article...

Or "database"

Both indicators (along with python) that this is an article written by a yet another data scientist that thinks engineering is unecessary. Data science is a field that is plagued by individuals having very specific narrow experiences that are often easy to engineer making them think they are experts in data engineering. Vertical scalability has a hard bound, if your algorithm changes complexity it could become literally impossible to move forward. Horizontal scalability gives you options, this is why we like it, not just because we like to overengineer systems, but because we don't like embracing hard limitations on the systems we engineer, especially when we often don't know yet what problem we are solving. Engineers often don't get the liberty of solving extremely specific instances of problems like data scientists do.

> Data science is a field that is plagued by individuals having very specific narrow experiences that are often easy to engineer making them think they are experts in data engineering.

That's on point, some of the worst and most inefficient code I had to maintain was written by data scientists. They are for sure leagues above any of my data science knowledge in their respective field but the average engineering knowledge and best practices seems pretty low.

Most of the time it's not that data scientists think engineering is "unnecessary," it's that they're expected to deliver analyses and models without the help of engineers, because the engineers are on another team, behind a massive backlog. So they make do with the tools they have.

This is totally fair, why we need to bring the two disciplines closer together rather than have one try to obsolete the other.

Um, yes. From the article:

"The simplest and most common way to implement indexing is by naming files in a directory:"


This is time partitioning, not indexing, but it's commonly done in data lakes for raw telemetry/log data.


Where each one of those CSV files could be 10-100 GB in size.

Usually you want to process it into a columnar format like Parquet from there, though

>A semi-common beginning programmer's exercise is to write a program that numbers the lines in a text file. The naive solution will use O(n) space, while a bit more thought reveals that this can be done in constant (to be really precise, O(log n) where n is the number of lines in the file) space.

I guess I'm misinterpreting the task, because I'm having trouble seeing how it's possible to access each line in a file without it being O(n) at a minimum. I understand O(log n) to mean that not every data point is "accessed" due to the way the data structure is organized, lines of a file in this case, but to me it appears that not every line would be prepended with a number. My instinct is that there's a lot going on with new lines/line breaks in files that I'm not aware of, so I'd really appreciate any reading material on the subject to help me better understand this. Thanks!

I think you've got confused with O(n) time; that is the amount of memory that is needed and not the time required to access a particular line.

Good point, I was thinking in terms of time complexity rather than space, but the title clearly is in reference to memory. I guess that explains my confusion!

I made the same mistake at first, had to reread the post. I'm so used to seeing Big O notation used for time that I guess my eyes just passed over the word "space".

(My mistake obviously, the OP was very clear.)

> A semi-common beginning programmer's exercise is to write a program that numbers the lines in a text file.

In term of RAM this can be done an arbitrarily small buffer in the sense that this can be done by reading the file byte by byte. I.e. O(1).

Edit: For all practical purposes the line counter is also constant space.

Yeah log(n) would be the space to hold the digits of the counter. Presumably (unless things got super-silly), this would be dominated by the buffer size for the line being read.

The point there is that you don’t need a buffer for whole line, you only have to find the line terminators. Something like:

    printf(“0: “);
    ln = 1;
    while ((ch = getch()) != EOF) {
      if (ch == EOL) {
        printf(“%d: “, ln);
And I believe that original unix implementation of nl is mostly this, although with the stdio functions replaced with raw read/write.

Yeah, I think the above poster said O(logn) space because that's the size of the line count in bits

The usual computation model includes "log n sized words", so it really is O (1). You usually are implicitly using that model, otherwise any algorithm that used pointers would have an extra log n tacked on, e.g. linked lists append would be log n not constant.

I was also surprised by the lack of streaming. In Python specifically for really large JSON datasets I wrote this streamer that many have found useful - https://github.com/kashifrazzaqui/json-streamer

>The naive solution will use O(n) space, while a bit more thought reveals that this can be done in constant (to be really precise, O(log n) where n is the number of lines in the file) space"

Can you elaborate on the O(1) solution?

Not an expert, but after reading around OP is saying you only need a constant sized buffer (that could be as little as a character deep) to read the whole file looking for line breaks. This roughly takes O(1) space (A lot of commenters are confusing time vs space requirements) but there's a little nitpick to be found, in extremely big files the line counter itself would be O(log(n)) and would dominate.

I see, yes that makes sense. The conflation with time instead of space was confusing me. Thanks.

> write a program that numbers the lines in a text file

Ah, simple :)

    cat file | wc -l

Useless use of `cat`. :)

    wc -l file
To hide filename:

    wc -l < file

I'm actually pretty much against UUoC. The cat example is much more resilient to modification. For instance, there are usually natural modifications considering filtering etc.

Anyway, since he said number we probably want `nl` not `wc`.

Separating pipe-setup from processing is sound engineering to me and makes things much easier. For instance as I iterate, I will have cat piped to head piped to filter but eventually I'll take the head out and run the whole thing. That's a trivial ^W versus moving around arguments and worrying about arg order, etc.

That's right regarding directly giving the filename for the program (here wc) to open itself.

However using the input redirect is fine, especially if you format it that way (which is exactly the same thing):

  <file wc -l
Then, adding another step is as natural as with cat:

  <file grep meh |wc -l

Useless use of `wc` :)

    <file grep -c meh

25 years using shells and I never thought to place input redirection at the front. Well, thank you!

I am confused. What's wrong with `wc -l < file | ... | ...`?

I know `< file wc -l | ... | ...` is identical because the redirection is done before the command is executed, but what does putting it in front help with?

Because I feel the other answers, though correct, are entirely missing out on clarity:

The first iteration looks like this:

<file head -n10 | grep foo | tr baz qux

The second iteration looks like this:

<file gzcat | head -n10 | grep foo | tr baz qux

It makes the change much easier, since the first thing written is the first thing that happens. In `gzcat < file`, the logical first step - reading and streaming file - is now the physical second step. Like a German sentence, whenever you want to prepend the file, you have to maintain an item in second position.

You can easily change that first command with muscle memory. That's why I use cat $file| but I guess <$file would work too.

As per James Taylor's comment above, using cat means you can easily replace it with head e.g. for testing.

And not change anything else in the pipeline.

I hate when people nitpick this stuff. What if we just straight prefer the first one?

There's a whole Wikipedia section for this:


It all makes a lot more sense now. I'm glad I can add demoggification to my vocabulary.

I usually use cat, because I often build a pipeline by starting with

  head filename | ...
then just change "head" to "cat" when I get it working, or grep if I want to check parts of a file, etc.

Likewise here, it maintains the left-to-right direction and "one operation per | gap |"

Bad code is code smell.

In this particular case, people (like me) would wonder, "what's the point" and then go searching for the details of how `cat` and `wc` work, assuming that there's might be a reason the original developer wrote this code, as opposed to the simpler `wc -l` (i.e. Chesterton's Fence).

People usually use it because piped input is a known interface, you don't have to go digging through a man page or help page to figure out what flag is needed to accept input from a file. "Data flows from left-to-right" is also very intuitive, and you don't need to know the specifics of how your shell interprets a command line to figure out where to put the input redirection in the case of a more complicated shell command.

The point is I personally find it more ergonomic. Just like how I'll take a slightly longer route to avoid turning directly across traffic or avoid a particularly dangerous intersection.

Also, context is really important. Not that people on the internet ever stop to think about what the context is that the person is writing the code in. I don't write bash scripts that run mission critical workloads. I just ad-hoc type stuff into my shell to produce output I need for stuff at the time.

Well one reason is programs being smart (dumb) about output if they think that dev stdout is a dev console. Using an extra cat eliminates this.

or just:

  cat -n file

Or even:

    nl file

   sed = 1.txt
Use small buffer. No need to fit all the data into memory.

O(log n) is so precise it ain't even informative :P

O(log n) is not constant

I think the point in this case is that log n has a reasonable upper bound, and can be considered constant for any real world problem.

> The easiest solution to not having enough RAM is to throw money at the problem.

People underestimate how much memory modern computers can actually support if you max them out.


I once had a job interview where they gave me a problem and expected some answer of "use a spark cluster, da da" and I said -- well given the problem specs and upper bounds you've given, I'd throw a $200 stick of RAM at it and be done.

They wanted some answer involving using spark engineers you'd effectively pay for at $200/hr for many weeks. I didn't get the job.

OK, but seriously speaking, if the upper bound is beyond 786gb RAM or whatever the current max is, I might want to use dask distributed. https://distributed.dask.org/en/latest/

edit: wrote mb instead of gb

It's slightly mystifying. The only company I've worked at that did "big data" _really_ well just plugged a few TB of RAM into some sharded databases and got on with life.

Usually when I tell that story, I get a lot of objections about how that solution won't scale and they must not have really had big data from people who are, truth be told, used to working with data on a fraction of the scale that this company did.

That said, it's not a turnkey solution. This company also was more meticulous about data engineering than others, and that certainly had its own cost.

You can't put Hadoop on your resume if you solved the problem in RAM instead.

I've always disliked the term "big data" because all of the attempts at a definition seemed either stupid or vague. After a while, I came up with this definition: it's a set of technnologies used for processing data that are too large to be processed in a single machine.

The thing that gets me about that definition is that "too large to be processd in a single machine" leaves out a lot of variables. How's the machine specced? How's it being analyzed? Using what kinds of software?

If the only single-machine option you consider is Pandas, which doesn't do streaming well and is built on a platform that makes ad-hoc multiprocessing a chore, you'll hit the ceiling a lot faster than if you had done it in Java, which might in turn be hard to push as far as something like C# (largely comparable to Java, but some platform features make it easier to be frugal with memory and mind your cache lines) or, dare I say it, something native like ocaml or C++.

Alternatively, if you start right off with Spark, you won't be able to push even one node as far as if you hadn't, because Spark is designed from the ground up for running on a cluster, and therefore has a tendency to manage memory the same way a 22-year-old professional basketball player handles money. It makes scale-out something of a self-fulfilling prophecy.

Also, as someone who was doing distributed data processing pipelines well before Hadoop and friends came along, I'm not sure I can swallow "big data" being one and the same as "handling data that is too big to run on one computer." Big data sort of implies a certain culture of handling data at that scale, too.

Because of that, I tend to think of "big data" as describing a culture as much as it describes anything practical. It's a set (not the only set) of technologies for procesing data on multiple machines. Whether you actually need multiple machines to do the job seems to be less relevant than the marketing team at IBM (to pick an easy punching back) would have us believe.

Saying big data is data too large to process on a single machine purposefully leaves out the spec of the machine.

That's because a reasonably sized machine from today is much larger than one from five years ago. And an unreasonably large machine today is also larger but yet more achievable.

A basic dual Epyc system can have 128 cores, and 2TB of ram. Someone mentioned 24 TB of ram, which is probably not a two socket system.

You can do a lot with 2TB of ram.

And there are still some use cases beyond the single machine: eg CERN.

But I think it's quite safe to say that it's not often because you need to process so much data, but rather that your experiment is a fire hose of data, and you're not sure what you want to keep, and what you can summarize - until after you've looked at the data.

And there might be a reason to keep an archive of the raw data as well.

Another common use case would be seismic data from geological/oil surveys.

But "human generated" data, where you're doing some kind of precise, high value recording, like click streams, card transactions etc might be "dense", but usually quite small compared to such "real world sampling".

And more often than not, it's a set of technologies used for processing data on many machines which could be processed faster on one.

People do not want to hear about how they’re working way, way too hard to feel special.

So far my use of "big data" was on networking management solutions for mobile networks, containing years of telecomunication data growing by the second, Oracle and SQL Server OLAP engines managed any kind of query without much sweat, other that fine tuning queries and indexes.

Bet their job description was like:"flexible", "open minded people" and "best tool for the job".

But then they dont usually like when you demonstrate anything but the answer they want to a highly contrived problem.

Yes, technical interviewing is hard.

> 786mb RAM

Did you mean gb? In any case you can actually do at least 24TB these days: https://aws.amazon.com/blogs/aws/ec2-high-memory-update-new-...

That's ludicrous.

Unlike most big data solutions, dask is just so convenient. I sometimes use it just to concatenate a bunch of files into a dataframe. And yet, it scales up to really large datasets, too.

768G is not current max. Recently bought a machine (for less than $3k) for a side project, it has 768G inside and is 4 years old.

Today most commodity servers (which you can buy out of pocket without lengthy preorder) can accomodate from 1.5 to 3 TiB of memory.

As a nightmare in the opposite direction, I've been writing a new pathfinding system for Second Life, from the user side. Where you get a maximum of 64KB per program. KB. Not MB. And that's for code, stack, and data, using the Mono engine.

You can have many programs, and they can pass messages around, so you can do jobs that won't fit in one program. It's like coding for a cluster of Arduinos.

I've had to pack data into integers with shift operations. Come up with an efficient maze solving algorithm that only needs 2 bits per cell. Divide tasks into multi-step pipelines. Monitor memory usage and divide the data into smaller chunks that will fit. Plus it's soft real time and has to run properly under overload conditions when it can't get enough CPU time.

It's kind of neat to see autonomous characters running around the virtual world at running speed. This was believed to be impossible, but I got it working out of sheer stubbornness. It was far too much work.

(Second Life has a built-in pathfinding system. It's too buggy to use, and they refuse to fix it. This is a workaround.) (Why such tiny programs? Because they are not only persistent, for years if necessary, but are copied from one machine to another, state and all, as objects move around.)

That's unreasonably cool. Do you think it consumes more resources than if they just gave you more than 64K to work with, or do you think you were pushed to optimize further because of the limitations?

It uses about 750KB total, spread over a dozen or so tiny programs. Probably 2x the memory and more CPU time because it's broken into so many pieces, both in time and code. And at least 2x-3x the development time.

It pushed me to work on minimum-memory maze solving. It costs a lot to test a cell (this involves ray-casting in the simulated world) so something like A*, which examines most of the cells, is out. An algorithm in Wikipedia turned out not to work; some anon had snuck in a reference to an obscure paper of their own. Had to fix that. What I'm doing is "head for the goal, when you hit something, follow the wall, if it will get closer, head for the goal again". Wall following is both ways simultaneously, so you don't take too long on the long path when a short path is available. After getting a path, the path is tightened up to take out the jaggies. This is not optimal but is usually reasonably good.

The rest of it is more or less routine, and a pain to break into sections. The programs have to communicate with JSON, over a weak IPC system with bad scheduling.

I always liked the 3D "metaverse" concept. Second Life, which has about 30,000 to 50,000 users on line at any one time, comparable to GTA Online, is the biggest virtual world around. Everybody bigger is sharded, but all SL users are in one world. The technology needs a refresh, but every competitor who's tried to build a big virtual world has been unable to get many users. So, you're stuck with outdated tech if you want to get something used in a virtual world. I think they're still in 32 bit mode on the servers, even.

And not just on super fancy computers, 128GB of RAM is like $500 and is supported by many low-to-mid-range desktop systems.

Without ECC that's a risky way to compute.

Eyeballing the numbers [0][1] I would expect one bitflip between once every year and once every 100 years on 128GB of memory. I'm willing to take those odds.

Of course with some planning you can get an AMD system with ECC support on the mainboard; ECC RAM is about the same price as consumer RAM.

0: https://www.cs.toronto.edu/~bianca/papers/sigmetrics09.pdf

1: https://www.cs.virginia.edu/~gurumurthi/papers/asplos15.pdf

When Craig Silverstein left Google somebody asked him "What was the biggest mistake you made while at Google?" His response was "Not using ECC memory."

The problem is that both the consequences and debugging time of a single random bitflip are potentially unbounded. That time that GMail lost 10% of their accounts and had to restore from tape was because of a single bitflip (due to a software bug, not a cosmic ray). Google Search lost months of engineering time in the early days from cosmic rays, back when they really could've used those engineers on other stuff.

The odds may be low, but they do happen when you have multiple computers, and the consequences are high. Not really odds I'd want to take, when ECC RAM isn't that much more expensive than non-ECC RAM.

Since Google was building their own servers it seems like they might have been able to sidestep the ECC tax too. Normally ECC costs so much more because it's the "serious user" solution so vendors feel free to mark it up.

I kinda wonder if we’ll come out the other side of this DRY era and rediscover redundant systems. Turning really important decisions into a single bit in memory is overoptimization. Especially when all of the less important bits are consuming gigabytes.

How does ECC RAM protect against software bugs?

To be clear, that sentence was in there to illustrate the effect that single-bit errors could cause. That incident happened after Google had already long since switched to ECC and obviously wouldn't have been prevented by it.

However, we have a large number of other processes (testing, type systems, formal verification, code reviews, release processes, etc.) to protect against software bugs. There is no protection against cosmic rays. You don't want to be in a situation where all of the defect-mitigation work that the last 50 years of computer science has accomplished is rendered useless by a random freak occurrence.

(The bug in question was actually in a migration script, and made it into production because people thought that migration scripts were one-off throwaways that didn't need the same amount of testing, code review, verification, and general carefulness that the production code does. Lesson learned. The postmortem for it actually had the lesson of "Treat your migration code as permanent, and apply all the same standards of maintainability and reliability of it that you do to production code.")

It's also why ZFS requires the use of ECC memory in the official documentation - ZFS spent great efforts building redundancy and error-checking capabilities as part of the filesystem, especially for guarding against silent data corruption, even at the expense of performance. But it would be useless and greatly decrease the benefits of these features if the memory can fail silently.

Also, hitting by a beam of cosmic ray is not the only way that the bits in RAM can be flipped, dynamic RAM has inherent instabilities like row hammering, or can fail early due to manufacturing defects.

Could explain how hardware would have prevented a software bug?

> Eyeballing the numbers [0][1] I would expect one bitflip between once every year and once every 100 years on 128GB of memory. I'm willing to take those odds.

Throw in the likely hood of the flipped bit being consequential and the odds look even better. To potentially do major damage the flipped bit would have to be in an area of memory of something being executed and it has to flip after being read and before being executed, which probably ads at least 2 more orders of magnitude. Even for general bitrot of data has to be in memory and get flipped between reads and writes. Those odds are vanishingly small compared to programmer error.

IIRC ryzen in general can run with ecc, though motherboard support is a little shaky

The low to midrange Xeons with this kind of memory capacity have ECC.

If you're not buying for your particular specifications, you're doing it wrong anyway... There are plenty of workloads tolerant to ECC errors (e.g. just about any kind of simulation).

That's a bit less true with the rise of the cloud, replacing physical machines. Instances are charged in proportion to the amount of memory, making TB of RAM noticeably expensive.

Wow it's less than $7/hour on AWS for 1TB of RAM on demand. $27/hour for 4TB. If you reserve instance (for 3 years) you can even go to 24TB.

Not that it's entirely equivalent (man hours of maintaining the local hardware/software vs man hours of AWS), but you could buy ~22.5TB of RAM outright for the price of using the 1TB AWS instance for $7/hour, assuming a current price of $2.75/GB of RAM. If you only need the 1 TB (5% cost), you could put difference ($58,000) towards hardware and an engineer at quarter time to maintain the hardware.

Bonus: you get to keep the hardware.

Unfortunately, big corporations think its "not scalable" solution and prefer to overpay 60k to outsource to AWS...

For the "big data" stuff under discussion, you're probably spinning up an instance for some training then turning it off again. $7/hour for an hour or two a day, is about $4k/year. Maybe you retrain weekly and it's only $500/year. At that kind of price point, who cares? As the person making these algorithms, I'd hate to have to wait to order new machines for me to run something new, or to schedule time on the cluster or whatever, instead of just throwing an elastic amount of money at it.

I can see the same argument being used for the stuff you could run locally up to 500k in infrastructure per month. People are so used to managed servers that they do not even see a possibility to use something else. Now everyone wants to be cloud agnostic (looking at conferences) so they want to eat the pie and have it too. While companies with mixed on premis/cloud setups pay the least based on all the consulting Ive done in this regard.

Sure, unless it's a laptop.

I often wonder how much of the computational universe in “high tech” is indeed blinded by what we can stick in a laptop.

That, and the idea of a polyphasic merge sort isn’t taught in ye olde boot camps, I guess.

I love a good merge sort, it's just built for chunking. I've used it to sort data that just barely fit on disk.

> Sure, unless it's a laptop.

That's a self-imposed constraint, not a problem constraint.

Computational resources are dirt cheap nowadays. Everyone can get free time in global scale clusters. The only reason anykne is stuck with a laptop to run number/data crunching tasks is because they want to.

I'm missing the most obvious one: check to see if you need all that data. Plenty of times you can get 98% of the quality with a small fraction of the available data. Doing a couple of tests to determine what constitutes a representative sample could save you a ton of time and a ton of money.

I spent a few weeks on-and-off between more important study, trying to help a friend process their sequencing data in a bespoke way. I kept trying to write an algorithm that would take about 500 hours to run. I first used Pandas (which is kinda slow) and then tried base python dict() and iterables. Eventually I realised that most of the data was actually redundant if I just counted how many instances of each unique row were in my dataset and then just threw away the extras while keeping the counts for later. My new algorithm did what I wanted in 8 seconds flat and was only a tiny bit more effort to integrate with the counts.

My lesson learnt was that I just needed to reduce how much data I was working with first, instead of trying to stuff a multi-gb file into memory!

This is the number one thing to check first.

I am surprised there is no mention of hdf5 (via h5py). It's what I used to deal with handling simulation data that was getting too big for RAM to handle.

Use case: CPU is running a simulation and constantly spitting out data, eventually RAM is not enough to hold said data. Solution: every N simulation steps, store data in RAM onto disk, and then continue simulating. N must be chosen judicially so as to balance time cost of writing to disk (don't want to do it too often).

I figure this is what is referred to as "chunking" in the article? Why not list some packages that can help one chunk?

Overall opinions on this method? Could it be done better?

This is exactly what HDF5 was built for. Figuring out how often to persist to disk is going to depend on a number of factors. If you want to get fancy you can dedicate a thread to I/O but that gets hairy quickly. You also might want to look into fast compression filters like blosc/lzf as a way of spending down some of your surplus CPU budget.

That depends - we use HDF5 to store data from huge simulations (~1TB/snap) which are then obviously too large to analyse on a single machine.

The benefit of HDF5 is that it allows for very easy slicing of data. So 'chunking' where I load e.g. 10% of one dataset at once is very simple with HDF5.

Also look at whether you can solve your problem with a different algorithm, one that is more friendly to external storage.

One classic example is sorting. When your data is too big for RAM, quicksort's performance is horrible and mergesort's performance is fine (even if your data is on magnetic tape).

Another classic example is taking a situation where you build a hash (or dictionary) and using sorted lists instead.

Let's say your task is to take some text and put <b></b> tags around a word but only the first time it occurs. The obvious solution is to scan through the text, building a hash as you encounter words so you can check if you've seen it before. Great until your hash doesn't fit in RAM.

The sort-based solution is to scan through the input file, break it into words, and output { word, byte_offset } tuples into a file. Then sort that file by word using a stable sort (or both fields as sort key). Now that all occurrences of each word are grouped together, make a pass through the sorted data and flag the first occurrence of each word, generating { byte_offset, word, is_first_occurrence } tuples. Then sort that by byte_offset. Finally you can make another pass through your input text and basically merge it with your sorted temp file, and check the is_first_occurrence flags. All of this uses O(1) RAM.

I believe this is basically what databases do with merge joins, but the point is you can apply this general type of thinking to your own programs as well.

I believe this is a problem that CS has already solved: http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf

The Count-Min Sketch is a data structure consisting of a fixed array of counters.

This is a short and easily accessible paper, which isn't very heavy on math or obscure notation or concepts, so I would recommend it to anyone with even a cursory interest in the subject.

Here's an implementation in Python: https://github.com/barrust/count-min-sketch/blob/master/pyth...

Another easy solution I didn't see mentioned is to create swap space on Linux. This obviously isn't the fastest solution, but setting up 128GB of swap space let's me mindlessly load most datasets into memory without a single code change.

Adding a 280GB Optane as swap is very efficient and cheap. It is still a lot slower than RAM though. But much much faster than NVME ssd's

Are you talking about Optane NVDIMM or NVME?

Do you have a sense of how the performance would compare to chunking? My naive expectation is that loading data from disk to swapped memory involves writing that data to disk (even if it will only be read).

Chunking can speed up processing even if the dataset fits into memory because you are interleaving disk reads and computation and the OS is likely to prefetch the next chunk into the read cache while you're still busy computing.

On other problems chunking doesn't work at all and just mmaping or dedicating giant amounts of swap are better strategies. It depends on the problem at hand

Using mmap with the right flags should let you load and process huge files as well as if they were in ram.

Plus, swap performance is being improved bit by bit, so it's not as much of a dirty word as it was before.


This! It's simple, performant depending on application and costs zero in extra development.

The "spin up a big data cluster" bit at the beginning seems like either a straw man or just oddly out of touch. Who, when determining how to process something on the order of a 100gb file, even considered something like that?

Also, SQLite is a first class citizen in this space. Most if not all languages can easily load data to it, virtually any language used for analysis can easily read from it, and it's file-based so there's no reason to spin up a server. Finding 100GB on disk is much easier than 100GB in ram.

Divide that file size by 10, and you're still in the range where I've had to argue that rolling out Spark is overkill.

People will disbelieve that, but it's absolutely true... I'll never forget the interview I did with an engineer who described an elaborate Hadoop-based solution to some past problem. When I asked him what type of data he was working with, he said, "Here, I'll show you," then whipped out his laptop and showed me a spreadsheet. It wasn't an extract of the data. It was literally a spreadsheet, manageable on a laptop, that he somehow decided needed a Hadoop cluster to process. (Also, who shows data from your current employer to a new prospective employer? Weird but true.)

I had an interesting experience a while back where it came to light that I was working on the same problem as another team in the org (this was a huge multinational), so a meeting was arranged so we could compare notes. The other team was slightly shocked to see that I could train a model in a minute or two, where it took them an hour or two using essentially the same algorithm.

They insisted that shouldn't be, because I was doing it on my laptop and they were using a high performance computing cluster. They of course wanted to know how my implementation could be so much faster despite running on only a single machine. I didn't have the heart to suggest that maybe it was because, not despite.

Ironically, I also got the implementation done in a lot fewer person-hours. I just did a straight code-up of the algorithm in the paper, where they had to do a bunch of extra work to figure out how to adapt it to scale-out.

This isn't to say that big data doesn't happen. Just that it's a bit like sex in high school: People talk about it a lot more than they actually have it, perhaps because everyone's afraid their friends will find out they don't have it.

What could be the reasons he wanted to use a cluster to process that spreadsheet? Just curious what was the problem they had.

Big ups for SQLite too. It's extremely easy to plug in (esp. if you're using Python) and rather performant. An added bonus is that you can run a simple query in SQL for free. It's just as easy as awk. If you do small/medium size of data processing (say, less than 100G), SQLite is a must tool to learn.

Many many people have proposed exactly that, especially just a few years ago when (not really) "big" data was a big fad and 100GB PCIe SSD cost more than the coins in the couch cushions.

I usually just use local spark for some large file that pandas cannot handle well. It's not perfect but it works for me.

I recently ran into an issue in a game where my team and I were storing hundreds of 160-byte config structures in memory. It blew up to over 86kb of data (which doesn't seem like much, but we're already pressed for memory at a 64mb-limit). I compressed the configs by packing the structure into a 2-byte bitfield, simply twiddling bits to get/set each parameter, and just like that the 86 kilobyte configs dropped to just over 1kb. We're still trying to reduce memory elsewhere but the savings showed we can do similar tricks to significantly save our memory footprint in other places.

Another key technique I've used is to use pipes and basic command line tools where possible to pre- or postprocess your data. For example a `sort | uniq -c | sort -nr | head` pipeline to get only the most frequently occurring lines works in a few kilobytes of ram no matter how big the input is. Combine this with chunking and you can get a lot of data processed in manageable amounts of memory.

Of course the next problem is "my data doesn't fit on disk", but with xzcat on the command line and lzma.open in python you can work transparently with compressed files.

> Of course the next problem is "my data doesn't fit on disk", but with xzcat on the command line and lzma.open in python you can work transparently with compressed files.

I usually use LZ4 compression for this purpose, because it's ridiculously fast (700 MB/s per core compression and 5 GB/s decompression). Sure, compression ratios aren't as good, but usually good enough.

Completely agree (I do this myself all the time) but keeo in mind the initial sort has not a constant cost in term of memory.

Yet, one realizes how well written these basic tools are only after having some bruises with fancier tools.

I don't know what POSIX says on the matter, but at least GNU sort uses some variation of merge sort with temporary files and has for all intents and purposes constant memory use.

Compression alleviates the problems "my current disks don't fit my data" and "my data is too big for my disk quota". It's comparatively bad at solving the problem "I cannot buy enough disks to fit my data".

zstd is nearly always superior to xz, fyi.

Go makes solving problems like this an absolute joy. I recently ran into a situation where I had to download and extract a very large tar archive, then send its contents to another network resource. The device doing the extraction had severely restricted memory and storage space.

Since gzipped tar archives contain sequential data, solving it ended up being trivial. With Go, I was able to string together the file download, tar extraction, and forwarding the individual files on, all within the context of a single stream.

How is that better than

    curl ... | tar xv | ...
in shell, or any programming language with a good streaming library?

It's not. It's not a novel concept. But it's very clean, readable code in Go.

Dask anyone? Allowed me to handle a 60 Gb dataset on a machine with 16 Gb RAM. High level of compatibility with Pandas dataframes.

I'm a Dask evangelist. It's a remarkable tool and is one of the first I reach for when this problem arises. Maybe it's not well known?

I mean, don't reach for Spark. Just get Dask, a databased or R's disk.frame is mostly fine.

There's a lot of room between a laptop with 16GB of RAM and a "Big Data cluster" - in my experience the easiest solution is to spin up a VM in GCP and crank the RAM way up.

It blew my mind when I realized I could add / remove RAM by turning off the instance, dragging a slider, then turning it back on.

I also find h5py really useful for creating massive numpy arrays on disk that are too large to fit in memory. I used it to precompute CNN features for a video classification model (much faster than computing on each gradient descent pass) and it makes it easy to read/write parts of a numpy array when the entire numpy array is too big to fit in memory.

Besides compression, is there an advantage to using h5py over numpy's memmap?

Dumb question, but wouldn't memory mapping or lazy/streaming processing (e.g. SAX for parsing large XML documents, Java Streams for handling large amounts of "things", memory mapped files for anything you don't need intermediate representations for) resolve this problem as well?

See above about streaming (chunking is basically the same thing).

mmaping is a form of uncontrolled chunking, where chunk size and location is determined by operating system's filesytem caching policy, readahead heuristics, and the like. As a result, it can sometimes have much worse performance than an explicit chunking strategy, especially if you just treat it as magic.

Or to put it another way, mmaping is helpful but you still need to understand why it might help and how to use it.

> it can sometimes have much worse performance than an explicit chunking strategy

Can you expand on this?

I haven’t heard this before.

Edit: Found a comprehensive discussion on this https://stackoverflow.com/questions/45972/mmap-vs-reading-bl...

I would just cat | grep | awk | sed on this stuff rather than resorting to Py.

Awk pipelines are exactly the area where Mario is useful for Python programmers.


Just want to compliment how well written this article is. A lot of technical articles lack the context or basic background (“why”) and this article does a great job.

Why / Intent is the biggest overlooked aspect of computer science / computer engineering imo.

I just wish more technical documentation had a human element of “what we intended” to it.

Fun fact for non CS people, which this article seems to address.

Nothing fits: disk to RAM, RAM to L3, L3 to L2, L2 to L1, L1 to registers. We are just lucky that many programs have spatial and temporal locality.

It looks like disk and RAM will merge soon into some form of insanely fast and persistent storage.

And L3 is already at 256MB (+32MB L2) in the new AMD Threadripper CPUs.

mmap() anyone?

Seriously, this should be the top of the list and default solution when you need to process large file that does not fit in memory.

The exceptions would be when you can read the file once (then stream might be easier but not cheaper), when you don't want to rely on FS cage (you need your own) or when you need easy interoperability between different OS-es.

I came to the comments after reading the article just to find the person suggesting mmap and upvote them :)

Exactly! Me too :D

It is quite saddening for me that so few developers try to take advantage of the tools available.

It is really magical thing to be able to run just a single function and have your entire terabyte file suddenly present itself as addressable continuous memory space ripe for random access.

Regular random access file I/O seems silly to me, like trying to work with the file through a key hole.

Have you ever seen how ships are built inside a bottle? That's exactly the picture I have in my mind...

Use Dask. In R for tabular data use disk.frame

I think the author completely misses a mention that sequential access to cold store is nowhere as slow as access to ram.

Cold store looses big in only scenarios where you have sequential and completely random seek patterns, and there are lots of way to optimise that in read and write heavy workloads. This was the art of running performant multi-terabyte DBs in a world prior to SSDs and ramdisks.

For very big index walks, you want to have data to be more flat, and seeks to be sorted, so there is higher chance that needed records would be accessed without page eviction. Modern DBs, I believe, do something like that internally.

And for write heavy loads, there is no alternative to revising "data structures 101." You can reduce the disk load by may times over with a properly picked tree, graph, or log structure.

Just wanted to add that tree and graph structures of course can be stored sequentially on disk or a flash drive (log-structured). For instance with a really fast random access storage and some additional space you can even save the history, that is lightweight snapshots of your data. That is you always append data.

This way you can also use clever versioning algorithms to predict the performance (no read or write peaks) and to lower the space consumption.

Do you a list of required reading?

One can efficiently stream data even when there's a need to combine multiple streams. I oftentimes have data sorted on its ID, for instance (usually when it comes from a database), and then can easily join and group by this ID (the same way `uniq` would).

All using Python's standard library, here's a quick post I wrote about this. I last used it a week ago and got from 950 MB RAM usage down to about 45 megs. https://kokes.github.io/blog/2018/11/25/merging-streams-pyth...

Want to compress and index your data at the same time? Use roaring bitmaps! A bitmap representation of data is surprisingly flexible for computation, and roaring uses some clever tricks to do that with high performance. For additional scalability, try pilosa, a distributed database built on top of roaring (I work for the company that maintains pilosa).

https://roaringbitmap.org/ https://www.pilosa.com/

I'm working on storing data in a log-structure persistently without the overhead of a transaction log in my spare time (Open Source[1]). Instead, an UberPage which guarantees consistency is atomically swapped during a transaction commit (inspired by ZFS). It should be used for data, which doesn't fit into main memory. Revisions are always appended, whereas almost only changed data plus some metadata is written to a file during commits.

You can easily retain the full version history in a log structure, but you need fast random access (option of parallel "real" access would be best) on a flash drive -- PCIe SSDs for instance.

Basically in order to balance read and write performance only a fraction of each database page (with changed records) needs to be written sequentially in batches to the end of a file.

Each revision is indexed under a RevisionRootPage and these are indexed under the UberPage with keyed tries.

This opens up a lot of opportunities for analysing data and its history.

[1] https://github.com/sirixdb/sirix

Ah yes. A problem they solved 50 years ago

We are about 5 years out from that being true for pretty much everything. Most of the problems we fight over were identified by the early 70's.

I wrote a library to deal with this problem when it comes to NumPy arrays (e.g. images). Could not fit all images in RAM, so I went with a constant-memory approach[0].

The reduction in memory has allowed us to do parallel processing, for a pretty significant speed-up!


Binary serialization and real-time compression like zstd have opened up a whole new world for me. Something like flatbuffer even lets you index into structured data without a deserialization penalty.

I’m also a fan of serialization that allows for streaming and in-memory access without fully deserializing. The project I’m working on uses CBOR and it’s been working really well. Not using the streaming, but being able to quickly access without transforming the data has been huge.

The article mentsions Rent a VM in the cloud, with 64 cores and 432GB RAM, for $3.62/hour.. Does anyone have specific services they'd recommend?

I'm working on a new platform to easily deploy compute/memory-intensive applications to cloud API endpoints [1]. This let's you share your app with others (e.g. colleagues) and makes the most sense if your task needs a lot of memory/CPU but not too much IO. Contact info is in my profile and I'm happy to discuss your use case!

[1] https://www explorablelabs.com

You mean like terraform using predefined ami >_>?

Plenty of options in the price range/ spec range for AWS: https://aws.amazon.com/ec2/pricing/on-demand/

If you sign up by month it will be much cheaper too.

Since this is an article about speed and data processing on a Python blog, I'll just point out that the RiseLab team that created Ray and RLlib also has a library called Modin, which is distributed Pandas. You just import modin as pd without changing the code: https://github.com/modin-project/modin

It will, however, from my reading convert the data to an actual pandas dataframe for certain operations. Which presumably is bad if your data doesn't fit into memory.

and in that vein, Dask: https://dask.org/

I'm currently running an index that requires over 64GB of ram. There's no possibility of lowering this, but I don't have a really high requirement for latency. So my solution was to get an M2 storage instance and create a 100GB swap space. Works like a charm and in the future, if I need the higher latency I just change the instance to something that has enough ram.

Another one: reservoir sampling. It's used in Deep CFR, a form of CFR that uses neural networks. CFR is an algorithm used for finding equilibrium strategies in imperfect information games, most notably poker.


Im following closely the development on an alternative to data.frame that works on disk (disk.frame),using the same API

A few years ago, I found surprisingly difficult to sort a file that didn't fit in RAM in Python; I couldn't find any simple libraries for this kind of stuff.

Disappointingly, I ended up writing my data to a text file, sorting with unix sort (w/ some tuning on --parallel and --buffer-size) and reading it back.

> Disappointingly, I ended up writing my data to a text file, sorting with unix sort (w/ some tuning on --parallel and --buffer-size) and reading it back.

One of my biggest beefs with Unix is that so much powerful functionality is locked up in command line utilities and not accessible as libraries.

There's an enormous amount of literature and technique in the mainframe world for dealing with data bigger than ram. Because that was the norm.

Whole companies have been built around the problem. e.g. syncsort.com

I hope some interesting techniques don't get locked up and lost in proprietry mainframe software.

step 1. use a generator instead of a loop https://realpython.com/introduction-to-python-generators/

> When your data doesn’t fit in memory: the basic techniques

Uhm... Isn't it called a database?

Jokes aside, one of the core use cases of things like databases is random access to a part of data that's too large to fit in RAM.

Hi, once upon a time, I worked with medical image data. I literally had a DBA suggest storing the volumetric data in a database. I worked to diligently to explain to him why he was incorrect. When he resisted my explanations and tried to push forward with his plan, I worked diligently to get him fired. I was successful.

For those of us who don’t work with medical image data, can you share why he was incorrect?

I realize that relational databases are not the right box to fit certain kinds of data into, but you have to put your data somewhere that allows it to be efficiently manipulated. What is that if not a “data base”?

Medical imaging data is typically dense 3D grids of density samples. A relational database could hold a pointer to the file, but not the 5GB-500GB dataset files. One dataset is typically a folder of various files, metadata, and other information representing an entire filesystem. Back in the day, the datasets would be RAID across several machines. Now they're virtual filesystems. The database could manage this info, but not hold the actual dataset itself. I imagine this is what the commentter means. The DBA probably said "it all needs to be in the database", and that just doesn't make sense.

Holds true for lots of data varieties; the database maintains links to compressed archives.

Let me try to make a metaphor for you. Imagine that you create HTML content, and you're considering sharing it with the world. You propose to create a "web server". The DBA next to you says you should just use MySQL. Stunned, you try to explain that your proposed "web server" will communicate using this new "HyperText Transfer Protocol." The DBA counters that SQL is fine. You try to explain about dynamic content and PHP, etc. The DBA counters that SQL stored procedures can do all of that.

Next thing you know, grandma is typing:


That's just an analogy, of course. But perhaps you can imagine your own list of reasons why MySQL didn't replace Apache.

The answer to your question is that we could not possibly use any off the shelf software to provide the interfaces we needed (we were writing our own). By the time we had bytes that we needed to store somewhere, our data was about a GB of complicated structure that we accessed as memory-mapped files. (Look at https://capnproto.org/ for a rough analogy to the kind of access we needed.)

And what I didn't say was that the DBA was literally recommending MySQL BLOBs, storing a bit more than 0.5 MB (512 * 512 * 2 bytes of image data) in each row, having a thousand rows or more per CT scan. The performance of that would have been absolute crap. It made literally zero sense.

Short version is that a medical image, like a typical MRI, is a 3D picture at least 1024 x 1024 x 1024 pixels, meaning each image is multiple gigabytes. SQL databases don't do well at storing huge binary blobs like that.

Right, but SQL databases aren't the only kind of database. I gather that the guy in the original anecdote was referring to a relational database specifically, but not all databases are relational. The filesystem is a perfectly good hierarchical document-based database, and it's worth thinking about it that way in both theory and in practice.

Stuff like this, back in the day: a RAIDed filesystem. Nowadays: a blob object store (like S3 or Azure Blob Store) or distributed FS such as HDFS

I'm sure you are aware, but some databases have specific extensions and technology for this. See https://en.wikipedia.org/wiki/Spatial_database (though that tends to focus on GIS stuff).

However, if you are simply "pulling blobs up" of compressed (or uncompressed) volumetric data... well yeah. Don't put that in a sql database.

My assumption was that the system had finite ram available, whether that's serialized in a database or read directly into the program.

So? Typically "database" implies a product like SQLite or Postgres. These handle many multiples of RAM size.

In-memory databases are a thing, but that's a very specialized use case.

Somewhat surprising fact about most PostgreSQL deployments is that most of the frequently accessed tuples will fit into RAM and it handles this usecase really well (to the extent that people who don’t know any better do not see any issue with LIKE ‘%whatewer%’ over 100M tuple table, it just takes 1.5s, so what...)

Pretty much all SQL DBs support pagination, and many libraries that abstract the DB away hide this as an iterator.

There is no simple solution because any compression/decompression costs you computations. If data is stored uncompressed, it takes more space but access takes less time than if the data were stored compressed.

If your data needs sorting and does not fit in memory: https://github.com/rmuchev/ExternalSort

Just wanted to chime in to say that the Tensorflow 2.0 Dataset Library is nice for this sort of problem. It supports batching, caching, remote fetching, multiple cpu core splitting...etc.

When your data doesn't fit in memory: SQLite

Yeah for me this was also the obvious solution when I had to extract information from ~80 GB of tabular data. It's not the most trendy tool, but it did the job extremely well and with little effort. Just be sure to only create an index after having inserted all the data...

Correct. Drop indexes for inserts and recreate after bulk inserts.

Also, another hint: process all inserts in the bulk in a single transaction. I was amazed at the speed that SQLite ingests data! Speed demon! (Don’t forget PRAGMA synchronous OFF too)

mmap can really help with this too. It won't stop you from being silly and using random memory access patterns, but it does abstract away the file access as a piece of address space. Short of running on a 32bit machine, this can help a lot. Even on a 32bit machine you can abstract it away with another later and window the access.

Somebody needs to tell this guy about a thing called a "database".

Love these kind of problems... is chunking the same as streaming in this context?

Odd not to see any reference to mmap, which is part of the python library.

You can spin up a standalone Spark cluster in about an hour and it's pretty easy.

And yes you will need to rewrite a small amount of your code but you're doing so here as well.

At least you will be able to scale out to much larger volumes in a consistent way.

For many tasks a spark cluster is overkill. Streaming will get you a really long way before you have to tackle running a while spark cluster. And you'll probably be able to still do it on your laptop.

You can't just apply a streaming paradigm to a batch style workload.

And 90% of all tasks are batch orientated.

Batch is just streaming in really large chunks. And besides if you are messing around with data on your laptop you probably aren't at the stage of working out a batch workload yet. Jumping immediately to this needs to be a batch workload before you have had a chance to play with it is probably overkill.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact