
When your data doesn’t fit in memory: the basic techniques - itamarst
https://pythonspeed.com/articles/data-doesnt-fit-in-memory/
======
userbinator
Unusual to have no occurrence of the word "stream" anywhere in the article...
given that's the usual term for algorithms which require a constant or almost-
constant space regardless of the amount of data they process.

A semi-common beginning programmer's exercise is to write a program that
numbers the lines in a text file. The naive solution will use O(n) space,
while a bit more thought reveals that this can be done in constant (to be
really precise, O(log n) where n is the number of lines in the file) space.

~~~
itamarst
Chunking is the term I used because that's more relevant to the data science
domain I'm focusing on here (e.g. Pandas has "chunksize", Zarr has "chunks").
Streaming has some implication of an ongoing stream of data to me... but I
ought to clarify some of the assumptions about a fixed size of data, yes.

~~~
dmurray
Chunking and streaming are different things to me. Chunking means you get to
process multiple rows of data at the same time, usually useful to take
advantage of SIMD. Streaming means that the data is accessed in a single pass:
once you compute the effect of a given row on your statistics, you never have
to rewind to see it again.

Many modern performant solutions will use both, but they're not the same
thing.

~~~
nine_k
One important application of chunking is efficient I/O.

Mass storage is most efficient when doing large sequential reads and writes,
so you normally feed your constant-space streaming algorithms from buffers
with a large number of input records.

Sometimes you can just tell the OS do efficient chunking prefetch for you.

~~~
hofstee
If you're streaming in a language like Python, its IO will be doing some
degree of chunking behind the scenes. It might be beneficial to do more
manually.

------
RcouF1uZ4gsC
> The easiest solution to not having enough RAM is to throw money at the
> problem.

People underestimate how much memory modern computers can actually support if
you max them out.

[http://www.itu.dk/people/jovt/fitinram/](http://www.itu.dk/people/jovt/fitinram/)

~~~
TuringNYC
I once had a job interview where they gave me a problem and expected some
answer of "use a spark cluster, da da" and I said -- well given the problem
specs and upper bounds you've given, I'd throw a $200 stick of RAM at it and
be done.

They wanted some answer involving using spark engineers you'd effectively pay
for at $200/hr for many weeks. I didn't get the job.

OK, but seriously speaking, if the upper bound is beyond 786gb RAM or whatever
the current max is, I might want to use dask distributed.
[https://distributed.dask.org/en/latest/](https://distributed.dask.org/en/latest/)

edit: wrote mb instead of gb

~~~
mumblemumble
It's slightly mystifying. The only company I've worked at that did "big data"
_really_ well just plugged a few TB of RAM into some sharded databases and got
on with life.

Usually when I tell that story, I get a lot of objections about how that
solution won't scale and they must not have _really_ had big data from people
who are, truth be told, used to working with data on a fraction of the scale
that this company did.

That said, it's not a turnkey solution. This company also was more meticulous
about data engineering than others, and that certainly had its own cost.

~~~
RivieraKid
I've always disliked the term "big data" because all of the attempts at a
definition seemed either stupid or vague. After a while, I came up with this
definition: it's a set of technnologies used for processing data that are too
large to be processed in a single machine.

~~~
mumblemumble
The thing that gets me about that definition is that "too large to be processd
in a single machine" leaves out a lot of variables. How's the machine specced?
How's it being analyzed? Using what kinds of software?

If the only single-machine option you consider is Pandas, which doesn't do
streaming well and is built on a platform that makes ad-hoc multiprocessing a
chore, you'll hit the ceiling a _lot_ faster than if you had done it in Java,
which might in turn be hard to push as far as something like C# (largely
comparable to Java, but some platform features make it easier to be frugal
with memory and mind your cache lines) or, dare I say it, something native
like ocaml or C++.

Alternatively, if you start right off with Spark, you won't be able to push
even one node as far as if you hadn't, because Spark is designed from the
ground up for running on a cluster, and therefore has a tendency to manage
memory the same way a 22-year-old professional basketball player handles
money. It makes scale-out something of a self-fulfilling prophecy.

Also, as someone who was doing distributed data processing pipelines well
before Hadoop and friends came along, I'm not sure I can swallow "big data"
being one and the same as "handling data that is too big to run on one
computer." Big data sort of implies a certain culture of handling data at that
scale, too.

Because of that, I tend to think of "big data" as describing a culture as much
as it describes anything practical. It's a set (not the only set) of
technologies for procesing data on multiple machines. Whether you actually
need multiple machines to do the job seems to be less relevant than the
marketing team at IBM (to pick an easy punching back) would have us believe.

~~~
toast0
Saying big data is data too large to process on a single machine purposefully
leaves out the spec of the machine.

That's because a reasonably sized machine from today is much larger than one
from five years ago. And an unreasonably large machine today is also larger
but yet more achievable.

A basic dual Epyc system can have 128 cores, and 2TB of ram. Someone mentioned
24 TB of ram, which is probably not a two socket system.

You can do a lot with 2TB of ram.

~~~
e12e
And there are still _some_ use cases beyond the single machine: eg CERN.

But I think it's quite safe to say that it's not often because you need to
process so much data, but rather that your experiment is a fire hose of data,
and you're not sure what you want to keep, and what you can summarize - until
_after_ you've looked at the data.

And there might be a reason to keep an archive of the raw data as well.

Another common use case would be seismic data from geological/oil surveys.

But "human generated" data, where you're doing some kind of precise, high
value recording, like click streams, card transactions etc might be "dense",
but usually quite small compared to such "real world sampling".

------
jacquesm
I'm missing the most obvious one: check to see if you need all that data.
Plenty of times you can get 98% of the quality with a small fraction of the
available data. Doing a couple of tests to determine what constitutes a
representative sample could save you a ton of time _and_ a ton of money.

~~~
ImaCake
I spent a few weeks on-and-off between more important study, trying to help a
friend process their sequencing data in a bespoke way. I kept trying to write
an algorithm that would take about 500 hours to run. I first used Pandas
(which is kinda slow) and then tried base python dict() and iterables.
Eventually I realised that most of the data was actually redundant if I just
counted how many instances of each unique row were in my dataset and then just
threw away the extras while keeping the counts for later. My new algorithm did
what I wanted in 8 seconds flat and was only a tiny bit more effort to
integrate with the counts.

My lesson learnt was that I just needed to reduce how much data I was working
with first, instead of trying to stuff a multi-gb file into memory!

------
bmer
I am surprised there is no mention of hdf5 (via h5py). It's what I used to
deal with handling simulation data that was getting too big for RAM to handle.

Use case: CPU is running a simulation and constantly spitting out data,
eventually RAM is not enough to hold said data. Solution: every N simulation
steps, store data in RAM onto disk, and then continue simulating. N must be
chosen judicially so as to balance time cost of writing to disk (don't want to
do it too often).

I figure this is what is referred to as "chunking" in the article? Why not
list some packages that can help one chunk?

Overall opinions on this method? Could it be done better?

~~~
CreRecombinase
This is exactly what HDF5 was built for. Figuring out how often to persist to
disk is going to depend on a number of factors. If you want to get fancy you
can dedicate a thread to I/O but that gets hairy quickly. You also might want
to look into fast compression filters like blosc/lzf as a way of spending down
some of your surplus CPU budget.

------
adrianmonk
Also look at whether you can solve your problem with a different algorithm,
one that is more friendly to external storage.

One classic example is sorting. When your data is too big for RAM, quicksort's
performance is horrible and mergesort's performance is fine (even if your data
is on magnetic tape).

Another classic example is taking a situation where you build a hash (or
dictionary) and using sorted lists instead.

Let's say your task is to take some text and put <b></b> tags around a word
but only the first time it occurs. The obvious solution is to scan through the
text, building a hash as you encounter words so you can check if you've seen
it before. Great until your hash doesn't fit in RAM.

The sort-based solution is to scan through the input file, break it into
words, and output { word, byte_offset } tuples into a file. Then sort that
file by word using a stable sort (or both fields as sort key). Now that all
occurrences of each word are grouped together, make a pass through the sorted
data and flag the first occurrence of each word, generating { byte_offset,
word, is_first_occurrence } tuples. Then sort that by byte_offset. Finally you
can make another pass through your input text and basically merge it with your
sorted temp file, and check the is_first_occurrence flags. All of this uses
O(1) RAM.

I believe this is basically what databases do with merge joins, but the point
is you can apply this general type of thinking to your own programs as well.

~~~
gspetr
I believe this is a problem that CS has already solved:
[http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf](http://dimacs.rutgers.edu/~graham/pubs/papers/cmsoft.pdf)

The Count-Min Sketch is a data structure consisting of a fixed array of
counters.

This is a short and easily accessible paper, which isn't very heavy on math or
obscure notation or concepts, so I would recommend it to anyone with even a
cursory interest in the subject.

Here's an implementation in Python: [https://github.com/barrust/count-min-
sketch/blob/master/pyth...](https://github.com/barrust/count-min-
sketch/blob/master/python/countminsketch.py)

------
zneveu
Another easy solution I didn't see mentioned is to create swap space on Linux.
This obviously isn't the fastest solution, but setting up 128GB of swap space
let's me mindlessly load most datasets into memory without a single code
change.

~~~
olavgg
Adding a 280GB Optane as swap is very efficient and cheap. It is still a lot
slower than RAM though. But much much faster than NVME ssd's

~~~
p1esk
Are you talking about Optane NVDIMM or NVME?

------
wooly_bully
The "spin up a big data cluster" bit at the beginning seems like either a
straw man or just oddly out of touch. Who, when determining how to process
something on the order of a 100gb file, even considered something like that?

Also, SQLite is a first class citizen in this space. Most if not all languages
can easily load data to it, virtually any language used for analysis can
easily read from it, and it's file-based so there's no reason to spin up a
server. Finding 100GB on disk is much easier than 100GB in ram.

~~~
mumblemumble
Divide that file size by 10, and you're still in the range where I've had to
argue that rolling out Spark is overkill.

~~~
zetazzed
People will disbelieve that, but it's absolutely true... I'll never forget the
interview I did with an engineer who described an elaborate Hadoop-based
solution to some past problem. When I asked him what type of data he was
working with, he said, "Here, I'll show you," then whipped out his laptop and
showed me a spreadsheet. It wasn't an extract of the data. It was literally a
spreadsheet, manageable on a laptop, that he somehow decided needed a Hadoop
cluster to process. (Also, who shows data from your current employer to a new
prospective employer? Weird but true.)

~~~
mumblemumble
I had an interesting experience a while back where it came to light that I was
working on the same problem as another team in the org (this was a huge
multinational), so a meeting was arranged so we could compare notes. The other
team was slightly shocked to see that I could train a model in a minute or
two, where it took them an hour or two using essentially the same algorithm.

They insisted that shouldn't be, because I was doing it on my laptop and they
were using a high performance computing cluster. They of course wanted to know
how my implementation could be so much faster despite running on only a single
machine. I didn't have the heart to suggest that maybe it was because, not
despite.

Ironically, I also got the implementation done in a lot fewer person-hours. I
just did a straight code-up of the algorithm in the paper, where they had to
do a bunch of extra work to figure out how to adapt it to scale-out.

This isn't to say that big data doesn't happen. Just that it's a bit like sex
in high school: People talk about it a lot more than they actually have it,
perhaps because everyone's afraid their friends will find out they don't have
it.

------
wongarsu
Another key technique I've used is to use pipes and basic command line tools
where possible to pre- or postprocess your data. For example a `sort | uniq -c
| sort -nr | head` pipeline to get only the most frequently occurring lines
works in a few kilobytes of ram no matter how big the input is. Combine this
with chunking and you can get a lot of data processed in manageable amounts of
memory.

Of course the next problem is "my data doesn't fit on disk", but with xzcat on
the command line and lzma.open in python you can work transparently with
compressed files.

~~~
muxator
Completely agree (I do this myself all the time) but keeo in mind the initial
sort has not a constant cost in term of memory.

Yet, one realizes how well written these basic tools are only after having
some bruises with fancier tools.

~~~
wongarsu
I don't know what POSIX says on the matter, but at least GNU sort uses some
variation of merge sort with temporary files and has for all intents and
purposes constant memory use.

------
hamsham
I recently ran into an issue in a game where my team and I were storing
hundreds of 160-byte config structures in memory. It blew up to over 86kb of
data (which doesn't seem like much, but we're already pressed for memory at a
64mb-limit). I compressed the configs by packing the structure into a 2-byte
bitfield, simply twiddling bits to get/set each parameter, and just like that
the 86 kilobyte configs dropped to just over 1kb. We're still trying to reduce
memory elsewhere but the savings showed we can do similar tricks to
significantly save our memory footprint in other places.

------
jniedrauer
Go makes solving problems like this an absolute joy. I recently ran into a
situation where I had to download and extract a very large tar archive, then
send its contents to another network resource. The device doing the extraction
had severely restricted memory and storage space.

Since gzipped tar archives contain sequential data, solving it ended up being
trivial. With Go, I was able to string together the file download, tar
extraction, and forwarding the individual files on, all within the context of
a single stream.

~~~
kccqzy
How is that better than

    
    
        curl ... | tar xv | ...
    

in shell, or any programming language with a good streaming library?

~~~
jniedrauer
It's not. It's not a novel concept. But it's very clean, readable code in Go.

------
floki999
Dask anyone? Allowed me to handle a 60 Gb dataset on a machine with 16 Gb RAM.
High level of compatibility with Pandas dataframes.

~~~
roaur
I'm a Dask evangelist. It's a remarkable tool and is one of the first I reach
for when this problem arises. Maybe it's not well known?

------
alexcnwy
There's a lot of room between a laptop with 16GB of RAM and a "Big Data
cluster" \- in my experience the easiest solution is to spin up a VM in GCP
and crank the RAM way up.

It blew my mind when I realized I could add / remove RAM by turning off the
instance, dragging a slider, then turning it back on.

I also find h5py really useful for creating massive numpy arrays on disk that
are too large to fit in memory. I used it to precompute CNN features for a
video classification model (much faster than computing on each gradient
descent pass) and it makes it easy to read/write _parts_ of a numpy array when
the entire numpy array is too big to fit in memory.

~~~
stilley2
Besides compression, is there an advantage to using h5py over numpy's memmap?

------
Sahonon
Dumb question, but wouldn't memory mapping or lazy/streaming processing (e.g.
SAX for parsing large XML documents, Java Streams for handling large amounts
of "things", memory mapped files for anything you don't need intermediate
representations for) resolve this problem as well?

~~~
itamarst
See above about streaming (chunking is basically the same thing).

mmaping is a form of uncontrolled chunking, where chunk size and location is
determined by operating system's filesytem caching policy, readahead
heuristics, and the like. As a result, it can sometimes have much worse
performance than an explicit chunking strategy, especially if you just treat
it as magic.

Or to put it another way, mmaping is helpful but you still need to understand
why it might help and how to use it.

~~~
faizshah
> it can sometimes have much worse performance than an explicit chunking
> strategy

Can you expand on this?

I haven’t heard this before.

Edit: Found a comprehensive discussion on this
[https://stackoverflow.com/questions/45972/mmap-vs-reading-
bl...](https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks)

------
amoron
I would just cat | grep | awk | sed on this stuff rather than resorting to Py.

~~~
j88439h84
Awk pipelines are exactly the area where Mario is useful for Python
programmers.

[https://github.com/python-mario/mario](https://github.com/python-mario/mario)

------
miles_matthias
Just want to compliment how well written this article is. A lot of technical
articles lack the context or basic background (“why”) and this article does a
great job.

~~~
SlowRobotAhead
Why / Intent is the biggest overlooked aspect of computer science / computer
engineering imo.

I just wish more technical documentation had a human element of “what we
intended” to it.

------
onikolas7
Fun fact for non CS people, which this article seems to address.

Nothing fits: disk to RAM, RAM to L3, L3 to L2, L2 to L1, L1 to registers. We
are just lucky that many programs have spatial and temporal locality.

~~~
bufferoverflow
It looks like disk and RAM will merge soon into some form of insanely fast and
persistent storage.

And L3 is already at 256MB (+32MB L2) in the new AMD Threadripper CPUs.

------
lmilcin
mmap() anyone?

Seriously, this should be the top of the list and default solution when you
need to process large file that does not fit in memory.

The exceptions would be when you can read the file once (then stream might be
easier but not cheaper), when you don't want to rely on FS cage (you need your
own) or when you need easy interoperability between different OS-es.

~~~
jdmoreira
I came to the comments after reading the article just to find the person
suggesting mmap and upvote them :)

~~~
neop1x
Exactly! Me too :D

~~~
lmilcin
It is quite saddening for me that so few developers try to take advantage of
the tools available.

It is really magical thing to be able to run just a single function and have
your entire terabyte file suddenly present itself as addressable continuous
memory space ripe for random access.

Regular random access file I/O seems silly to me, like trying to work with the
file through a key hole.

Have you ever seen how ships are built inside a bottle? That's exactly the
picture I have in my mind...

------
xiaodai
Use Dask. In R for tabular data use disk.frame

------
baybal2
I think the author completely misses a mention that sequential access to cold
store is nowhere as slow as access to ram.

Cold store looses big in only scenarios where you have sequential and
completely random seek patterns, and there are lots of way to optimise that in
read and write heavy workloads. This was the art of running performant multi-
terabyte DBs in a world prior to SSDs and ramdisks.

For very big index walks, you want to have data to be more flat, and seeks to
be sorted, so there is higher chance that needed records would be accessed
without page eviction. Modern DBs, I believe, do something like that
internally.

And for write heavy loads, there is no alternative to revising "data
structures 101." You can reduce the disk load by may times over with a
properly picked tree, graph, or log structure.

~~~
lichtenberger
Just wanted to add that tree and graph structures of course can be stored
sequentially on disk or a flash drive (log-structured). For instance with a
really fast random access storage and some additional space you can even save
the history, that is lightweight snapshots of your data. That is you always
append data.

This way you can also use clever versioning algorithms to predict the
performance (no read or write peaks) and to lower the space consumption.

------
drej
One can efficiently stream data even when there's a need to combine multiple
streams. I oftentimes have data sorted on its ID, for instance (usually when
it comes from a database), and then can easily join and group by this ID (the
same way `uniq` would).

All using Python's standard library, here's a quick post I wrote about this. I
last used it a week ago and got from 950 MB RAM usage down to about 45 megs.
[https://kokes.github.io/blog/2018/11/25/merging-streams-
pyth...](https://kokes.github.io/blog/2018/11/25/merging-streams-python.html)

------
alanbernstein
Want to compress and index your data at the same time? Use roaring bitmaps! A
bitmap representation of data is surprisingly flexible for computation, and
roaring uses some clever tricks to do that with high performance. For
additional scalability, try pilosa, a distributed database built on top of
roaring (I work for the company that maintains pilosa).

[https://roaringbitmap.org/](https://roaringbitmap.org/)
[https://www.pilosa.com/](https://www.pilosa.com/)

------
lichtenberger
I'm working on storing data in a log-structure persistently without the
overhead of a transaction log in my spare time (Open Source[1]). Instead, an
UberPage which guarantees consistency is atomically swapped during a
transaction commit (inspired by ZFS). It should be used for data, which
doesn't fit into main memory. Revisions are always appended, whereas almost
only changed data plus some metadata is written to a file during commits.

You can easily retain the full version history in a log structure, but you
need fast random access (option of parallel "real" access would be best) on a
flash drive -- PCIe SSDs for instance.

Basically in order to balance read and write performance only a fraction of
each database page (with changed records) needs to be written sequentially in
batches to the end of a file.

Each revision is indexed under a RevisionRootPage and these are indexed under
the UberPage with keyed tries.

This opens up a lot of opportunities for analysing data and its history.

[1] [https://github.com/sirixdb/sirix](https://github.com/sirixdb/sirix)

------
elchief
Ah yes. A problem they solved 50 years ago

~~~
hinkley
We are about 5 years out from that being true for pretty much everything. Most
of the problems we fight over were identified by the early 70's.

------
cosmic_quanta
I wrote a library to deal with this problem when it comes to NumPy arrays
(e.g. images). Could not fit all images in RAM, so I went with a constant-
memory approach[0].

The reduction in memory has allowed us to do parallel processing, for a pretty
significant speed-up!

[0](github.com/LaurentRDC/npstreams)

------
lllr_finger
Binary serialization and real-time compression like zstd have opened up a
whole new world for me. Something like flatbuffer even lets you index into
structured data without a deserialization penalty.

~~~
SlowRobotAhead
I’m also a fan of serialization that allows for streaming and in-memory access
without fully deserializing. The project I’m working on uses CBOR and it’s
been working really well. Not using the streaming, but being able to quickly
access without transforming the data has been huge.

------
daveslash
The article mentsions _Rent a VM in the cloud, with 64 cores and 432GB RAM,
for $3.62 /hour._. Does anyone have specific services they'd recommend?

~~~
kathrynloving
I'm working on a new platform to easily deploy compute/memory-intensive
applications to cloud API endpoints [1]. This let's you share your app with
others (e.g. colleagues) and makes the most sense if your task needs a lot of
memory/CPU but not too much IO. Contact info is in my profile and I'm happy to
discuss your use case!

[1] [https://www](https://www) explorablelabs.com

~~~
pojzon
You mean like terraform using predefined ami >_>?

------
blueyes
Since this is an article about speed and data processing on a Python blog,
I'll just point out that the RiseLab team that created Ray and RLlib also has
a library called Modin, which is distributed Pandas. You just import modin as
pd without changing the code: [https://github.com/modin-
project/modin](https://github.com/modin-project/modin)

~~~
marcinzm
It will, however, from my reading convert the data to an actual pandas
dataframe for certain operations. Which presumably is bad if your data doesn't
fit into memory.

------
opdahl
I'm currently running an index that requires over 64GB of ram. There's no
possibility of lowering this, but I don't have a really high requirement for
latency. So my solution was to get an M2 storage instance and create a 100GB
swap space. Works like a charm and in the future, if I need the higher latency
I just change the instance to something that has enough ram.

------
RivieraKid
Another one: reservoir sampling. It's used in Deep CFR, a form of CFR that
uses neural networks. CFR is an algorithm used for finding equilibrium
strategies in imperfect information games, most notably poker.

[https://en.wikipedia.org/wiki/Reservoir_sampling](https://en.wikipedia.org/wiki/Reservoir_sampling)

------
longemen3000
Im following closely the development on an alternative to data.frame that
works on disk (disk.frame),using the same API

------
cpa
A few years ago, I found surprisingly difficult to sort a file that didn't fit
in RAM in Python; I couldn't find any simple libraries for this kind of stuff.

Disappointingly, I ended up writing my data to a text file, sorting with unix
sort (w/ some tuning on --parallel and --buffer-size) and reading it back.

~~~
madhadron
> Disappointingly, I ended up writing my data to a text file, sorting with
> unix sort (w/ some tuning on --parallel and --buffer-size) and reading it
> back.

One of my biggest beefs with Unix is that so much powerful functionality is
locked up in command line utilities and not accessible as libraries.

------
emmelaich
There's an enormous amount of literature and technique in the mainframe world
for dealing with data bigger than ram. Because that was the norm.

Whole companies have been built around the problem. e.g. syncsort.com

I hope some interesting techniques don't get locked up and lost in proprietry
mainframe software.

------
gwbas1c
> When your data doesn’t fit in memory: the basic techniques

Uhm... Isn't it called a database?

Jokes aside, one of the core use cases of things like databases is random
access to a part of data that's too large to fit in RAM.

~~~
VikingCoder
Hi, once upon a time, I worked with medical image data. I literally had a DBA
suggest storing the volumetric data in a database. I worked to diligently to
explain to him why he was incorrect. When he resisted my explanations and
tried to push forward with his plan, I worked diligently to get him fired. I
was successful.

~~~
Twisol
For those of us who don’t work with medical image data, can you share why he
was incorrect?

I realize that _relational_ databases are not the right box to fit certain
kinds of data into, but you have to put your data somewhere that allows it to
be efficiently manipulated. What is that if not a “data base”?

~~~
aspaceman
Medical imaging data is typically dense 3D grids of density samples. A
relational database could hold a pointer to the file, but not the 5GB-500GB
dataset files. One dataset is typically a folder of various files, metadata,
and other information representing an entire filesystem. Back in the day, the
datasets would be RAID across several machines. Now they're virtual
filesystems. The database could manage this info, but not hold the actual
dataset itself. I imagine this is what the commentter means. The DBA probably
said "it all needs to be in the database", and that just doesn't make sense.

~~~
xapata
Holds true for lots of data varieties; the database maintains links to
compressed archives.

------
RocketSyntax
step 1. use a generator instead of a loop
[https://realpython.com/introduction-to-python-
generators/](https://realpython.com/introduction-to-python-generators/)

------
novaRom
There is no simple solution because any compression/decompression costs you
computations. If data is stored uncompressed, it takes more space but access
takes less time than if the data were stored compressed.

------
rmuchev
If your data needs sorting and does not fit in memory:
[https://github.com/rmuchev/ExternalSort](https://github.com/rmuchev/ExternalSort)

------
__sy__
Just wanted to chime in to say that the Tensorflow 2.0 Dataset Library is nice
for this sort of problem. It supports batching, caching, remote fetching,
multiple cpu core splitting...etc.

------
ak39
When your data doesn't fit in memory: SQLite

~~~
lytefm
Yeah for me this was also the obvious solution when I had to extract
information from ~80 GB of tabular data. It's not the most trendy tool, but it
did the job extremely well and with little effort. Just be sure to only create
an index after having inserted all the data...

~~~
ak39
Correct. Drop indexes for inserts and recreate after bulk inserts.

Also, another hint: process all inserts in the bulk in a single transaction. I
was amazed at the speed that SQLite ingests data! Speed demon! (Don’t forget
PRAGMA synchronous OFF too)

------
beached_whale
mmap can really help with this too. It won't stop you from being silly and
using random memory access patterns, but it does abstract away the file access
as a piece of address space. Short of running on a 32bit machine, this can
help a lot. Even on a 32bit machine you can abstract it away with another
later and window the access.

------
aledalgrande
Love these kind of problems... is chunking the same as streaming in this
context?

------
drderidder
Somebody needs to tell this guy about a thing called a "database".

------
snvzz
Odd not to see any reference to mmap, which is part of the python library.

------
threeseed
You can spin up a standalone Spark cluster in about an hour and it's pretty
easy.

And yes you will need to rewrite a small amount of your code but you're doing
so here as well.

At least you will be able to scale out to much larger volumes in a consistent
way.

~~~
zaphar
For many tasks a spark cluster is overkill. Streaming will get you a really
long way before you have to tackle running a while spark cluster. And you'll
probably be able to still do it on your laptop.

~~~
threeseed
You can't just apply a streaming paradigm to a batch style workload.

And 90% of all tasks are batch orientated.

~~~
zaphar
Batch is just streaming in really large chunks. And besides if you are messing
around with data on your laptop you probably aren't at the stage of working
out a batch workload yet. Jumping immediately to this needs to be a batch
workload before you have had a chance to play with it is probably overkill.

------
scottlocklin
^f mmap

yeah, that's what I thought.

------
cozzyd
Here's where ROOT really shines

------
anotherevan
The nice thing about virtual memory is you can make really big RAM disks…

~~~
gowld
Isn't that a regular disk, with your OS's (or SSHD's) file cache?

~~~
mbreese
No, in this context, it's creating a virtual disk using only RAM as a backing
store. If you do this, then you can use whatever tools you'd normally use
(cat, wc, grep, etc) to explore a dataset as if they were files. It's the
super quick and dirty way to go if you don't want to code up a proper solution
_and_ your data fits in RAM.

(assuming the data is organized in files to begin with)

------
eggie5
one word: Dask

------
nik_ma
Nice article

------
dschuetz
Did I miss something? Why would anyone open and _load_ a 100G file all at once
on a 16G machine? Why is there such a file in the first place?

