
Ask HN: Sorting massive text files? - JBerlinsky
I've got a ~35GB text file full of data, and I want to parse it so I only have unique results in the end file. In the past, I've had no problem with cat FILE | sort | uniq, but I've never worked with anything of this magnitude. Even running a line count takes an extraordinary amount of time:<p>time cat FILE | wc -l
2608560847<p>real	11m18.148s
user	1m35.667s
sys	1m33.820s
[root@server src]#<p>Any suggestions on how I can go about getting unique records from this type of file?
======
timr
Unix sort already does a merge sort (search for it), which is your default
solution if you just want to sort individual lines based on a field:

[http://vkundeti.blogspot.com/2008/03/tech-algorithmic-
detail...](http://vkundeti.blogspot.com/2008/03/tech-algorithmic-details-of-
unix-sort.html)

Thus, sort -u <filename> is your go-to for simple jobs. (Note that you'll need
to have enough extra disk space to hold all of the temporary files.)

If you need to do something more sophisticated (e.g. joining lines from a web
server log into sessions, then sorting sessions), you can still use divide-
and-conquer, but you have to be smarter. Divide the file into N parts based on
some logic (i.e. divide into files based on session ID), then sort the lines
in each individually, then merge the results back together.

This is what map/reduce frameworks are made to do, of course, but something
like Hadoop may be overkill unless you plan to do this type of thing often.

~~~
ralph
GNU sort has a --compress-program option that can help if you're really tight
on scratch space, e.g. lzop(1). The -T option or TMPDIR environment variable
lets you state a filesystem with more free space can be used instead.

~~~
pjscott
The -T option especially helps if you have more than one hard drive, and
you're sorting a very large file where disk I/O is the bottleneck. You can use
a second hard drive for the temporary files.

------
lsb
0) Get your data into Amazon's cloud, and rent a server on EC2 with 68GB of
memory. Spot instances are a buck an hour.

1) Do sort -u --buffer-size=60GB FILE. You'll sort it all in memory, and
that'll be a great speedup.

It's easier to scale up than scale out if you have the money, and your dataset
fits in memory, so don't bother with Hadoop for something as simple as that.
What do you want to do after you get the uniques?

~~~
trin_
but wouldnt that also include uploading 35gb of data and finally downloading
xxGB (result size)? i think when you add this time you can just take your
multicore desktop and let it run over night.

------
brown9-2
The canonical Hadoop example is counting the number of words in a large text
file <http://wiki.apache.org/hadoop/WordCount>

You could use the same approach to simply take the unique tokens of the output
of the final "count".

May be overkill if you you can read the file in less than an hour, but this
approach (divide and conquer) may be a good inspiration.

------
naz

       sort -u <filename>

would be faster than all that piping

~~~
secretasiandan
I've never used it but there's a -S flag for sort that allows you to specify
buffer size. So maybe look into increasing that as large as you can.

~~~
sqrt17
-S is quite useful. For bigger sorting tasks, I usually use -S50% to speed up the speed of sorting. (sort's default memory size must be about 1MB).

It may also help to use compression with the temporary files (something like "
--compress-program lzop"), but I've never tried that.

------
ralph
If you're going to process this file a lot, and it's straightforward text,
then consider compressing it with lzop(1). lzop's fast as compressing, and
very fast at decompressing, and the overhead of having to decompress on every
read through may well be much less than the time saved by waiting for disk.

    
    
        lzop -9v $file
        lzop -dc $file.lzo | sort ...
    

You shouldn't use cat(1) if you don't need to. You're removing the ability of
wc(1) to do faster things than read(2) from a pipe, e.g. mmap(2) the file,
which is probably what cat is doing.

Similarly don't use uniq(1) if sort's -u option will suffice as you're forcing
35GB of data to be written down a pipe to uniq(1) when it's quite possible
that `sort -u' would produce 100MB or so.

------
pbh
Personally, I would run:

    
    
      split -d -C 1G FILE split.FILE # separate into 1G files on line boundaries
      # then, on separate cores
      sort -u split.FILE.00 -o sort.FILE.00
      sort -u split.FILE.01 -o sort.FILE.01
      ...
      sort -u split.FILE.N -o sort.FILE.N
      # then, on one core
      sort -m -u sort.FILE.*
    

You should be aware that you may not get what you expect from sort unless you
set LC_ALL="C". You should also pass "-S" to sort to set the main memory size,
probably to a value like "1G". Of course, at some point when you're doing a
lot of distributed processing you'll just want to use Hadoop, as other posters
have noted.

~~~
pjscott
It looks like GNU sort already does this internally, but with smarter temp
file management. Increasing the memory size looks like a good idea, and if you
have multiple hard drives, it may help to specify a temporary directory on a
different hard drive with the -T option, since this problem looks IO-bound.

~~~
pbh
I believe GNU sort is a single-threaded external multi-pass/multi-way
mergesort.

<http://en.wikipedia.org/wiki/External_sorting>

The reason to split is not to replicate mergesort, but rather to get around
single threading in GNU sort. Of course, there are better options if this is a
regular task, but GNU sort just happens to be really common and easy if it's a
one-off.

I do wish I had a better split (since split seems to read the whole file) and
a better sort (that was parallel and reasonably common) though.

------
JoachimSchipper
Your example suggests that you are unhappy with the time it takes to read data
from disk (the above is _clearly_ I/O-bound). There are a zillion ways to get
unique records from the file - the frequently-mentioned sort -u is one - but
there's no way to do it without reading all of the file...

------
uuoc
You also gain a "Useless use of cat" award for both of your examples:
<http://partmaps.org/era/unix/award.html>

Both can be rewritten as "sort FILE | uniq" and "time wc -l FILE".

And given the data size, both non-cat uses would likely be faster as well.

------
pixelbeat
We've made a multicore update to GNU sort recently, so with latest from git
one can just do: sort -u That will chunk, compress, merge automatically Also
run with LC_ALL=C if that's appropriate as it will greatly speedup. Also see
the -S

------
wooby
You might check out bashreduce: <http://github.com/erikfrey/bashreduce>

------
nagrom
If this is just a one off job, I would guess that running sort -u $FILE will
take less time than writing some specific software for the purpose.

Of course, if you want to learn hadoop then go for it. But it's probably more
practical just to let it run :-)

------
davidst
How important is it that you get every unique line? Is it ok for some small
percentage of unique lines to not appear in the output?

How much memory do you have available?

The Bloom filter is a good way to go. Here is one possibility:

If you have 2 billion unique lines here's how much memory you need for a Bloom
filter:

Error rate Memory

\------------ --------

1 / thousand 3.35 GB

1 / million 6.70 GB

1 / billion 10.05 GB

1 / 10 billion 11.16 GB

1 / 100 billion 12.28 GB

If you have 12 GB available you can reduce the chance of missing a unique
string down to almost zero.

~~~
davidst
BTW, as you desire to decrease your error rate Cuckoo hashing starts to become
an attractive alternative to Bloom filters:

<http://en.wikipedia.org/wiki/Cuckoo_hashing>
<http://www.ru.is/faculty/ulfar/CuckooHash.pdf>

You can get your hash table utilization to almost 100%. If you're willing to
accept a 64-bit hash function and you have 2 billion unique strings you'll
need just over 16GB to do it all in one pass.

A 64-bit hash is is just barely enough for 2 billion items. You might get a
handful of collisions. Add more bits to reduce the probability. 128-bits would
double the required size to 32GB but you're unlikely to ever see a collision
in your life time.

If 16GB or 32GB is too much you could do it in multiple passes. Make N passes
through the text file and adjust the range of hash values you accept to test
only 1/N each time.

8 passes through the original file with a generous 128-bit hash would take
only 4GB of RAM.

I have the feeling this is more work than you would want to put into the
problem, though.

------
andymoe
Shove your data into postgresql or mysql with one of their bulk import
utilities and let the db handle the hard work of indexing and filtering.
That's what they are designed to do. You can then slice and dice the data to
your hearts content. Make sure you have plenty of disk space as indexing you
data will take up a good amount of space as will the overhead of storing it in
the db in the first place.

~~~
jey
That's a very roundabout and slow way to do it if he never uses the DB for
anything else. Building those B-trees will be way more expensive than a merge
sort.

~~~
andymoe
I suppose, but I guess we don't really know the actual problem he is trying to
solve. If he wants to do anything else with the data down the road then he is
back to square one and has to start manipulating text files all over again.

------
aw3c2
I guess a line count has to run through the whole file, so obviously your
storage is the bottleneck. Let's say your storage manages 100MB/s constantly
for the whole 35GB it would take about 6 minutes.

Your 'cat' is not needed by the way, sort takes a filename as argument. And
sort can '-u'!

It would be interesting if sort would fail on this (why?) or how long it would
take.

edit: Why on earth are you doing this as root?

------
meterplech
If you have access to K (or you can use the free J), this is exactly the type
of problem they can solve. If you put the file into a list x, ?x is the only
command you need to get a list of uniques.

<http://en.wikipedia.org/wiki/K_(programming_language)>

~~~
Groxx
Does "put[ting] the file into a list" mean "loading the file into memory"?

~~~
silentbicycle
I'm pretty sure they mmap the file. J and K are both designed for doing an
operation over massive collections. You'll have to read it off the disk one
way or the other, though.

------
secretasiandan
Do you know what percent of records you expect to be unique?

Also, what do you want to do with the unique records? That might effect what
initial processing method is best for your goal.

------
Tichy
Never did something like it, but can't resist thinking about it.

Maybe a kind of divide and conquer could work? Split into several files, do
the sort | uniq on each of them. Then merge them, checking for duplicates on
the way. I think merging should be almost as fast as line counting, at least
linear in the size of the two files.

Edit: I guess it would be slower than counting, because presumably it would
write to a new file (the merged file). But still, it should be linear.

------
rwaliany
I would use map/reduce to solve this problem. You can run a single node
psuedo-cluster in a few minutes by using Cloudera's package repositories
(CDH3,
[https://wiki.cloudera.com/display/DOC/Hadoop+(CDH3)+Quick+St...](https://wiki.cloudera.com/display/DOC/Hadoop+\(CDH3\)+Quick+Start+Guide)).
<http://en.wikipedia.org/wiki/MapReduce>

------
xsltuser2010
I recently did something similar and was surprised to see the python sort
command be way faster than unix sort.

The only thing is that you have to split it up and merge after sorting (for
which unix sort was ok enough).

Not sure why I got that result, but even with increased buffer size for unix
sort it didnt much differ. I also didn't run the splitted sorts in parallel,
which would of course have been a good idea.

~~~
jerf
Python's sort algorithm is Timsort: <http://en.wikipedia.org/wiki/Timsort>

You may have had a data set that tickled something that plays to Timsort's
advantage; Timsort was basically designed to encounter that case as often as
possible on real data.

------
Methos
Get a few EC2 instances and use MapReduce(hadoop)

~~~
earle
you can just upload to S3 and use AWS Elastic Map Reduce, fire off 10 nodes,
and have this done in minutes. No need to start EC2 instances

------
pgbovine
first i'd split it up into a bunch of smaller files, maybe 100 files that are
each 350 MB?

then if you have multiple cores on your machine, you can run multiple
instantiations of the same script on those smaller files and aggregate the
results later.

if you're more ambitious, you could look into using a lightweight MapReduce
framework

------
carterschonwald
heres the way to do it incrementally in a way that plays nice with having more
text than you have ram, programmatically:

first split the file into chunks that will comfortably fit in ram, depending
on the encoding of the file and the language's default encoding, allow maybe
~2x blow up in memory size, though experimenting is more accurate than
guessing.

Then sort each of these files in place.

then have a program then opens all these smaller sorted files, and does an
incremental line by line merge that compares them over all these files, with
the case of two lines being equal to drop one, and write the lines that are
unique to the result file, and then tada!

i actually first dealt with this problem on a programming interview, and I
quite liked how instructive an example of out of core programming it was.

~~~
nailer
That sounds an awful lot like merge sort.

------
pvg
_Any suggestions on how I can go about getting unique records from this type
of file?_

If that's what you're after, you don't need to sort at all, you can pick out
your uniques with a 3 line script in your favourite language, split up the
file first, if necessary.

~~~
aristus
The file has 2+ billion lines. A set or hash table won't work very if if there
are, say 500 million unique strings.

~~~
pvg
Hence the split. It's still likely to beat the crap out of sorting and it
might not be needed, depending on the data. I do wonder what the dataset is
that makes for such tiny, short lines.

~~~
aristus
Given the numbers, the need to dedupe, the poster's strong motivation and
relative inexperience...

I'd guess it's a gigantic spam list.

~~~
pvg
Hah, self-duh. Total failure of technical cynicism on my part.

------
b0b0b0b
Did it take 11 minutes to do a line count? This wasn't across a network, was
it? If so, then do the work local to where the data resides. I have had
success munging textual data by making sure it stays compressed (lzop & pigz
are good).

------
alexgartrell
Bloom filter + memmapped hash table is generally a pretty good way to go.

<http://en.wikipedia.org/wiki/Bloom_filter>

~~~
sqrt17
a bloom filter may give you false negatives (i.e., you think you've seen it
before, but you didn't). Or do you mean, use a bloom filter (~2-3MB, so that
it fits in the L3 cache) and look in the mmapped hash table only if the bloom
filter indicates that you may have seen the line already?

That sounds at least halfway feasible to me - you assume 1/3 duplicates, and
20 bytes per line, you'd get ~8GB worth of hashtable entries (i.e., if you
want the hashtable 70% filled to limit the amount of collisions, you'd need
12GB of virtual memory to back the hashtable, but only rarely access it since
you're using the Bloom filter).

(To the person who downvoted it: can you say why you don't like the idea?)

------
bravo_sierra
Sort, with this map/reduce done in Bash:
<http://news.ycombinator.com/item?id=1604780>

------
macros
You do realize that the 11 minute time is just about what you would expect to
read the file off a single drive?

------
keefe
maybe you have to partition the file but I'm guessing random access will allow
you to access it fine, I haven't ever poked a 35GB text file either.

first thing I'd try is probably pseudo-this:

while((entry=readDataPoint())!=null){

    
    
           sha1e = sha1(entry); 
    
           //database has log N index on sha1e
    
           if(database.contains(sha1e)) continue;
     
           //database object figures out insert syntax
    
           database.insert({sha1:sha1e,data:entry});
    
    }

------
sordina
Have you looked at indexing the data using something like lucene?

------
a904guy
cat FILE | awk -F(dataseperator) "{print $(number of seperation to print)}" |
sort -g | uniq | sort -g

~~~
uuoc
And you also gain a useless use of cat award:
<http://partmaps.org/era/unix/award.html>

The above can be written: awk -F(dataseperator) "{print $(number of seperation
to print)}" FILE | sort -g | uniq | sort -g

~~~
ralph
What do you and the original poster hope to achieve by

    
    
        ...| sort -g | uniq | sort -g
    

The second sort is doing nothing. And sort can do uniq, saving time and I/O.

    
    
        $ (seq 10; seq 5) | shuf | sort -gu | fmt
        1 2 3 4 5 6 7 8 9 10
        $

------
agentq
The first thing I'd recommend doing is find the unique entries THEN sort.
Unless uniq has vastly better performance on sorted files...

~~~
sausagefeet
Finding duplicates in unsorted data is pretty time consuming.

~~~
glhaynes
You should see the people at my company do it by hand with pieces of paper!
With _n_ approaching 20,000 sometimes. (Yes, we're working on automating this
even as we speak.)

~~~
peripitea
Would you mind giving more details on this? I don't understand how it would be
possible for a human to do this with n > 1kish, and even then I would imagine
it being horribly slow.

~~~
glhaynes
I'm pretty sure actually that they _don't_ do it when it gets to 20,000 docs
(and thus we pay a bit extra for postage/materials than we might otherwise
have to), exactly for the same reason you think so. But that's the adamant
claim of those from whom requirements are gathered. It'll take me 5 minutes to
implement the filter function to make this happen -- far less time than it'd
take for me to sit down with them and have them prove that they actually do
this operation. So I haven't pressed the issue.

I know for a fact that they _do_ do it on smaller batches, though. It takes a
lot of room, as you can imagine!

