I've got a ~35GB text file full of data, and I want to parse it so I only have unique results in the end file. In the past, I've had no problem with cat FILE | sort | uniq, but I've never worked with anything of this magnitude. Even running a line count takes an extraordinary amount of time:
time cat FILE | wc -l
Any suggestions on how I can go about getting unique records from this type of file?
Thus, sort -u <filename> is your go-to for simple jobs. (Note that you'll need to have enough extra disk space to hold all of the temporary files.)
If you need to do something more sophisticated (e.g. joining lines from a web server log into sessions, then sorting sessions), you can still use divide-and-conquer, but you have to be smarter. Divide the file into N parts based on some logic (i.e. divide into files based on session ID), then sort the lines in each individually, then merge the results back together.
This is what map/reduce frameworks are made to do, of course, but something like Hadoop may be overkill unless you plan to do this type of thing often.
GNU sort has a --compress-program option that can help if you're really tight on scratch space, e.g. lzop(1). The -T option or TMPDIR environment variable lets you state a filesystem with more free space can be used instead.
Right, but sort is apparently clever enough to do a disk-backed mergesort on a file (according to timr's comment), but it doesn't get a chance to recognize that you're sorting a file if you just pipe in the data.
0) Get your data into Amazon's cloud, and rent a server on EC2 with 68GB of memory. Spot instances are a buck an hour.
1) Do sort -u --buffer-size=60GB FILE. You'll sort it all in memory, and that'll be a great speedup.
It's easier to scale up than scale out if you have the money, and your dataset fits in memory, so don't bother with Hadoop for something as simple as that. What do you want to do after you get the uniques?
split -d -C 1G FILE split.FILE # separate into 1G files on line boundaries
# then, on separate cores
sort -u split.FILE.00 -o sort.FILE.00
sort -u split.FILE.01 -o sort.FILE.01
sort -u split.FILE.N -o sort.FILE.N
# then, on one core
sort -m -u sort.FILE.*
You should be aware that you may not get what you expect from sort unless you set LC_ALL="C". You should also pass "-S" to sort to set the main memory size, probably to a value like "1G". Of course, at some point when you're doing a lot of distributed processing you'll just want to use Hadoop, as other posters have noted.
It looks like GNU sort already does this internally, but with smarter temp file management. Increasing the memory size looks like a good idea, and if you have multiple hard drives, it may help to specify a temporary directory on a different hard drive with the -T option, since this problem looks IO-bound.
The reason to split is not to replicate mergesort, but rather to get around single threading in GNU sort. Of course, there are better options if this is a regular task, but GNU sort just happens to be really common and easy if it's a one-off.
I do wish I had a better split (since split seems to read the whole file) and a better sort (that was parallel and reasonably common) though.
Your example suggests that you are unhappy with the time it takes to read data from disk (the above is clearly I/O-bound). There are a zillion ways to get unique records from the file - the frequently-mentioned sort -u is one - but there's no way to do it without reading all of the file...
We've made a multicore update to GNU sort recently, so with latest from git one can just do: sort -u
That will chunk, compress, merge automatically
Also run with LC_ALL=C if that's appropriate as it will greatly speedup. Also see the -S
If you're going to process this file a lot, and it's straightforward text, then consider compressing it with lzop(1). lzop's fast as compressing, and very fast at decompressing, and the overhead of having to decompress on every read through may well be much less than the time saved by waiting for disk.
lzop -9v $file
lzop -dc $file.lzo | sort ...
You shouldn't use cat(1) if you don't need to. You're removing the ability of wc(1) to do faster things than read(2) from a pipe, e.g. mmap(2) the file, which is probably what cat is doing.
Similarly don't use uniq(1) if sort's -u option will suffice as you're forcing 35GB of data to be written down a pipe to uniq(1) when it's quite possible that `sort -u' would produce 100MB or so.
You can get your hash table utilization to almost 100%. If you're willing to accept a 64-bit hash function and you have 2 billion unique strings you'll need just over 16GB to do it all in one pass.
A 64-bit hash is is just barely enough for 2 billion items. You might get a handful of collisions. Add more bits to reduce the probability. 128-bits would double the required size to 32GB but you're unlikely to ever see a collision in your life time.
If 16GB or 32GB is too much you could do it in multiple passes. Make N passes through the text file and adjust the range of hash values you accept to test only 1/N each time.
8 passes through the original file with a generous 128-bit hash would take only 4GB of RAM.
I have the feeling this is more work than you would want to put into the problem, though.
a bloom filter may give you false negatives (i.e., you think you've seen it before, but you didn't).
Or do you mean, use a bloom filter (~2-3MB, so that it fits in the L3 cache) and look in the mmapped hash table only if the bloom filter indicates that you may have seen the line already?
That sounds at least halfway feasible to me - you assume 1/3 duplicates, and 20 bytes per line, you'd get ~8GB worth of hashtable entries (i.e., if you want the hashtable 70% filled to limit the amount of collisions, you'd need 12GB of virtual memory to back the hashtable, but only rarely access it since you're using the Bloom filter).
(To the person who downvoted it: can you say why you don't like the idea?)
Shove your data into postgresql or mysql with one of their bulk import utilities and let the db handle the hard work of indexing and filtering. That's what they are designed to do. You can then slice and dice the data to your hearts content. Make sure you have plenty of disk space as indexing you data will take up a good amount of space as will the overhead of storing it in the db in the first place.
I suppose, but I guess we don't really know the actual problem he is trying to solve. If he wants to do anything else with the data down the road then he is back to square one and has to start manipulating text files all over again.
heres the way to do it incrementally in a way that plays nice with having more text than you have ram, programmatically:
first split the file into chunks that will comfortably fit in ram, depending on the encoding of the file and the language's default encoding, allow maybe ~2x blow up in memory size, though experimenting is more accurate than guessing.
Then sort each of these files in place.
then have a program then opens all these smaller sorted files, and does an incremental line by line merge that compares them over all these files, with the case of two lines being equal to drop one, and write the lines that are unique to the result file, and then tada!
i actually first dealt with this problem on a programming interview, and I quite liked how instructive an example of out of core programming it was.
Never did something like it, but can't resist thinking about it.
Maybe a kind of divide and conquer could work? Split into several files, do the sort | uniq on each of them. Then merge them, checking for duplicates on the way. I think merging should be almost as fast as line counting, at least linear in the size of the two files.
Edit: I guess it would be slower than counting, because presumably it would write to a new file (the merged file). But still, it should be linear.
Did it take 11 minutes to do a line count? This wasn't across a network, was it? If so, then do the work local to where the data resides. I have had success munging textual data by making sure it stays compressed (lzop & pigz are good).
I'm pretty sure actually that they don't do it when it gets to 20,000 docs (and thus we pay a bit extra for postage/materials than we might otherwise have to), exactly for the same reason you think so. But that's the adamant claim of those from whom requirements are gathered. It'll take me 5 minutes to implement the filter function to make this happen -- far less time than it'd take for me to sit down with them and have them prove that they actually do this operation. So I haven't pressed the issue.
I know for a fact that they do do it on smaller batches, though. It takes a lot of room, as you can imagine!