Hacker News new | comments | show | ask | jobs | submit login
Ask HN: Sorting massive text files?
44 points by JBerlinsky 2353 days ago | hide | past | web | 68 comments | favorite
I've got a ~35GB text file full of data, and I want to parse it so I only have unique results in the end file. In the past, I've had no problem with cat FILE | sort | uniq, but I've never worked with anything of this magnitude. Even running a line count takes an extraordinary amount of time:

time cat FILE | wc -l 2608560847

real 11m18.148s user 1m35.667s sys 1m33.820s [root@server src]#

Any suggestions on how I can go about getting unique records from this type of file?

Unix sort already does a merge sort (search for it), which is your default solution if you just want to sort individual lines based on a field:


Thus, sort -u <filename> is your go-to for simple jobs. (Note that you'll need to have enough extra disk space to hold all of the temporary files.)

If you need to do something more sophisticated (e.g. joining lines from a web server log into sessions, then sorting sessions), you can still use divide-and-conquer, but you have to be smarter. Divide the file into N parts based on some logic (i.e. divide into files based on session ID), then sort the lines in each individually, then merge the results back together.

This is what map/reduce frameworks are made to do, of course, but something like Hadoop may be overkill unless you plan to do this type of thing often.

Merge sort works very well if you have multiple cores available. The general idea is to split the file, sort in parallel, and then merge.

Here's a blog post I wrote about it:


GNU sort has a --compress-program option that can help if you're really tight on scratch space, e.g. lzop(1). The -T option or TMPDIR environment variable lets you state a filesystem with more free space can be used instead.

The -T option especially helps if you have more than one hard drive, and you're sorting a very large file where disk I/O is the bottleneck. You can use a second hard drive for the temporary files.

This is the approach that I use except that I generate a Makefile which has:

  - a target to split the input into chunks
  - targets to sort each of the split chunks
  - a target to merge the sorted chunks using "sort -m"
Then, if you are using GNU make, it's just a matter of using the -j flag to control how many chunks are sorted in parallel.

0) Get your data into Amazon's cloud, and rent a server on EC2 with 68GB of memory. Spot instances are a buck an hour.

1) Do sort -u --buffer-size=60GB FILE. You'll sort it all in memory, and that'll be a great speedup.

It's easier to scale up than scale out if you have the money, and your dataset fits in memory, so don't bother with Hadoop for something as simple as that. What do you want to do after you get the uniques?

but wouldnt that also include uploading 35gb of data and finally downloading xxGB (result size)? i think when you add this time you can just take your multicore desktop and let it run over night.

The canonical Hadoop example is counting the number of words in a large text file http://wiki.apache.org/hadoop/WordCount

You could use the same approach to simply take the unique tokens of the output of the final "count".

May be overkill if you you can read the file in less than an hour, but this approach (divide and conquer) may be a good inspiration.

   sort -u <filename>
would be faster than all that piping

I've never used it but there's a -S flag for sort that allows you to specify buffer size. So maybe look into increasing that as large as you can.

-S is quite useful. For bigger sorting tasks, I usually use -S50% to speed up the speed of sorting. (sort's default memory size must be about 1MB).

It may also help to use compression with the temporary files (something like "--compress-program lzop"), but I've never tried that.

Compared to the IO costs, the computational difference must be inconsequential.

Right, but sort is apparently clever enough to do a disk-backed mergesort on a file (according to timr's comment), but it doesn't get a chance to recognize that you're sorting a file if you just pipe in the data.

Why did you get modded up? You're wrong. Sort will write temporary files to /tmp if you exceed its in-memory sort threshold.

Did you really think sort(1) just bails if you pipe it more data than can be contained in memory?

You're correct. Perhaps the up-voters were thinking it must just be better to report the piping. Which it is, but that doesn't make the comment valid.

Er, you're right. I don't know what I was thinking.

I believe it uses scratch space in /tmp to do it if you pipe it in. (The manpage says that it allows you to specify an alternative temporary directory than $TMPDIR, however)

If you're going to process this file a lot, and it's straightforward text, then consider compressing it with lzop(1). lzop's fast as compressing, and very fast at decompressing, and the overhead of having to decompress on every read through may well be much less than the time saved by waiting for disk.

    lzop -9v $file
    lzop -dc $file.lzo | sort ...
You shouldn't use cat(1) if you don't need to. You're removing the ability of wc(1) to do faster things than read(2) from a pipe, e.g. mmap(2) the file, which is probably what cat is doing.

Similarly don't use uniq(1) if sort's -u option will suffice as you're forcing 35GB of data to be written down a pipe to uniq(1) when it's quite possible that `sort -u' would produce 100MB or so.

Personally, I would run:

  split -d -C 1G FILE split.FILE # separate into 1G files on line boundaries
  # then, on separate cores
  sort -u split.FILE.00 -o sort.FILE.00
  sort -u split.FILE.01 -o sort.FILE.01
  sort -u split.FILE.N -o sort.FILE.N
  # then, on one core
  sort -m -u sort.FILE.*
You should be aware that you may not get what you expect from sort unless you set LC_ALL="C". You should also pass "-S" to sort to set the main memory size, probably to a value like "1G". Of course, at some point when you're doing a lot of distributed processing you'll just want to use Hadoop, as other posters have noted.

It looks like GNU sort already does this internally, but with smarter temp file management. Increasing the memory size looks like a good idea, and if you have multiple hard drives, it may help to specify a temporary directory on a different hard drive with the -T option, since this problem looks IO-bound.

I believe GNU sort is a single-threaded external multi-pass/multi-way mergesort.


The reason to split is not to replicate mergesort, but rather to get around single threading in GNU sort. Of course, there are better options if this is a regular task, but GNU sort just happens to be really common and easy if it's a one-off.

I do wish I had a better split (since split seems to read the whole file) and a better sort (that was parallel and reasonably common) though.

Your example suggests that you are unhappy with the time it takes to read data from disk (the above is clearly I/O-bound). There are a zillion ways to get unique records from the file - the frequently-mentioned sort -u is one - but there's no way to do it without reading all of the file...

You also gain a "Useless use of cat" award for both of your examples: http://partmaps.org/era/unix/award.html

Both can be rewritten as "sort FILE | uniq" and "time wc -l FILE".

And given the data size, both non-cat uses would likely be faster as well.

We've made a multicore update to GNU sort recently, so with latest from git one can just do: sort -u That will chunk, compress, merge automatically Also run with LC_ALL=C if that's appropriate as it will greatly speedup. Also see the -S

You might check out bashreduce: http://github.com/erikfrey/bashreduce

If this is just a one off job, I would guess that running sort -u $FILE will take less time than writing some specific software for the purpose.

Of course, if you want to learn hadoop then go for it. But it's probably more practical just to let it run :-)

How important is it that you get every unique line? Is it ok for some small percentage of unique lines to not appear in the output?

How much memory do you have available?

The Bloom filter is a good way to go. Here is one possibility:

If you have 2 billion unique lines here's how much memory you need for a Bloom filter:

Error rate Memory

------------ --------

1 / thousand 3.35 GB

1 / million 6.70 GB

1 / billion 10.05 GB

1 / 10 billion 11.16 GB

1 / 100 billion 12.28 GB

If you have 12 GB available you can reduce the chance of missing a unique string down to almost zero.

BTW, as you desire to decrease your error rate Cuckoo hashing starts to become an attractive alternative to Bloom filters:

http://en.wikipedia.org/wiki/Cuckoo_hashing http://www.ru.is/faculty/ulfar/CuckooHash.pdf

You can get your hash table utilization to almost 100%. If you're willing to accept a 64-bit hash function and you have 2 billion unique strings you'll need just over 16GB to do it all in one pass.

A 64-bit hash is is just barely enough for 2 billion items. You might get a handful of collisions. Add more bits to reduce the probability. 128-bits would double the required size to 32GB but you're unlikely to ever see a collision in your life time.

If 16GB or 32GB is too much you could do it in multiple passes. Make N passes through the text file and adjust the range of hash values you accept to test only 1/N each time.

8 passes through the original file with a generous 128-bit hash would take only 4GB of RAM.

I have the feeling this is more work than you would want to put into the problem, though.

Shove your data into postgresql or mysql with one of their bulk import utilities and let the db handle the hard work of indexing and filtering. That's what they are designed to do. You can then slice and dice the data to your hearts content. Make sure you have plenty of disk space as indexing you data will take up a good amount of space as will the overhead of storing it in the db in the first place.

That's a very roundabout and slow way to do it if he never uses the DB for anything else. Building those B-trees will be way more expensive than a merge sort.

I suppose, but I guess we don't really know the actual problem he is trying to solve. If he wants to do anything else with the data down the road then he is back to square one and has to start manipulating text files all over again.

I guess a line count has to run through the whole file, so obviously your storage is the bottleneck. Let's say your storage manages 100MB/s constantly for the whole 35GB it would take about 6 minutes.

Your 'cat' is not needed by the way, sort takes a filename as argument. And sort can '-u'!

It would be interesting if sort would fail on this (why?) or how long it would take.

edit: Why on earth are you doing this as root?

If you have access to K (or you can use the free J), this is exactly the type of problem they can solve. If you put the file into a list x, ?x is the only command you need to get a list of uniques.


Does "put[ting] the file into a list" mean "loading the file into memory"?

I'm pretty sure they mmap the file. J and K are both designed for doing an operation over massive collections. You'll have to read it off the disk one way or the other, though.

Do you know what percent of records you expect to be unique?

Also, what do you want to do with the unique records? That might effect what initial processing method is best for your goal.

Never did something like it, but can't resist thinking about it.

Maybe a kind of divide and conquer could work? Split into several files, do the sort | uniq on each of them. Then merge them, checking for duplicates on the way. I think merging should be almost as fast as line counting, at least linear in the size of the two files.

Edit: I guess it would be slower than counting, because presumably it would write to a new file (the merged file). But still, it should be linear.

I would use map/reduce to solve this problem. You can run a single node psuedo-cluster in a few minutes by using Cloudera's package repositories (CDH3, https://wiki.cloudera.com/display/DOC/Hadoop+(CDH3)+Quick+St...). http://en.wikipedia.org/wiki/MapReduce

I recently did something similar and was surprised to see the python sort command be way faster than unix sort.

The only thing is that you have to split it up and merge after sorting (for which unix sort was ok enough).

Not sure why I got that result, but even with increased buffer size for unix sort it didnt much differ. I also didn't run the splitted sorts in parallel, which would of course have been a good idea.

Python's sort algorithm is Timsort: http://en.wikipedia.org/wiki/Timsort

You may have had a data set that tickled something that plays to Timsort's advantage; Timsort was basically designed to encounter that case as often as possible on real data.

Get a few EC2 instances and use MapReduce(hadoop)

you can just upload to S3 and use AWS Elastic Map Reduce, fire off 10 nodes, and have this done in minutes. No need to start EC2 instances

first i'd split it up into a bunch of smaller files, maybe 100 files that are each 350 MB?

then if you have multiple cores on your machine, you can run multiple instantiations of the same script on those smaller files and aggregate the results later.

if you're more ambitious, you could look into using a lightweight MapReduce framework

heres the way to do it incrementally in a way that plays nice with having more text than you have ram, programmatically:

first split the file into chunks that will comfortably fit in ram, depending on the encoding of the file and the language's default encoding, allow maybe ~2x blow up in memory size, though experimenting is more accurate than guessing.

Then sort each of these files in place.

then have a program then opens all these smaller sorted files, and does an incremental line by line merge that compares them over all these files, with the case of two lines being equal to drop one, and write the lines that are unique to the result file, and then tada!

i actually first dealt with this problem on a programming interview, and I quite liked how instructive an example of out of core programming it was.

That sounds an awful lot like merge sort.

Any suggestions on how I can go about getting unique records from this type of file?

If that's what you're after, you don't need to sort at all, you can pick out your uniques with a 3 line script in your favourite language, split up the file first, if necessary.

The file has 2+ billion lines. A set or hash table won't work very if if there are, say 500 million unique strings.

Hence the split. It's still likely to beat the crap out of sorting and it might not be needed, depending on the data. I do wonder what the dataset is that makes for such tiny, short lines.

Given the numbers, the need to dedupe, the poster's strong motivation and relative inexperience...

I'd guess it's a gigantic spam list.

Hah, self-duh. Total failure of technical cynicism on my part.

an in-memory set won't work. indexes don't have to be in memory all the time to provide log N lookup, I mean just think about how b-trees work.

...and this is a problem most databases have addressed.

It might well work, it depends a great deal on the data. In any event, I was talking about in-memory hashing, the responder's assumption that I was was correct.

Did it take 11 minutes to do a line count? This wasn't across a network, was it? If so, then do the work local to where the data resides. I have had success munging textual data by making sure it stays compressed (lzop & pigz are good).

Bloom filter + memmapped hash table is generally a pretty good way to go.


a bloom filter may give you false negatives (i.e., you think you've seen it before, but you didn't). Or do you mean, use a bloom filter (~2-3MB, so that it fits in the L3 cache) and look in the mmapped hash table only if the bloom filter indicates that you may have seen the line already?

That sounds at least halfway feasible to me - you assume 1/3 duplicates, and 20 bytes per line, you'd get ~8GB worth of hashtable entries (i.e., if you want the hashtable 70% filled to limit the amount of collisions, you'd need 12GB of virtual memory to back the hashtable, but only rarely access it since you're using the Bloom filter).

(To the person who downvoted it: can you say why you don't like the idea?)

Sort, with this map/reduce done in Bash: http://news.ycombinator.com/item?id=1604780

You do realize that the 11 minute time is just about what you would expect to read the file off a single drive?

maybe you have to partition the file but I'm guessing random access will allow you to access it fine, I haven't ever poked a 35GB text file either.

first thing I'd try is probably pseudo-this:


       sha1e = sha1(entry); 

       //database has log N index on sha1e

       if(database.contains(sha1e)) continue;
       //database object figures out insert syntax



Have you looked at indexing the data using something like lucene?

cat FILE | awk -F(dataseperator) "{print $(number of seperation to print)}" | sort -g | uniq | sort -g

And you also gain a useless use of cat award: http://partmaps.org/era/unix/award.html

The above can be written: awk -F(dataseperator) "{print $(number of seperation to print)}" FILE | sort -g | uniq | sort -g

What do you and the original poster hope to achieve by

    ...| sort -g | uniq | sort -g
The second sort is doing nothing. And sort can do uniq, saving time and I/O.

    $ (seq 10; seq 5) | shuf | sort -gu | fmt
    1 2 3 4 5 6 7 8 9 10

The first thing I'd recommend doing is find the unique entries THEN sort. Unless uniq has vastly better performance on sorted files...

Finding duplicates in unsorted data is pretty time consuming.

You should see the people at my company do it by hand with pieces of paper! With n approaching 20,000 sometimes. (Yes, we're working on automating this even as we speak.)

Would you mind giving more details on this? I don't understand how it would be possible for a human to do this with n > 1kish, and even then I would imagine it being horribly slow.

I'm pretty sure actually that they don't do it when it gets to 20,000 docs (and thus we pay a bit extra for postage/materials than we might otherwise have to), exactly for the same reason you think so. But that's the adamant claim of those from whom requirements are gathered. It'll take me 5 minutes to implement the filter function to make this happen -- far less time than it'd take for me to sit down with them and have them prove that they actually do this operation. So I haven't pressed the issue.

I know for a fact that they do do it on smaller batches, though. It takes a lot of room, as you can imagine!

uniq requires a sorted file

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact