

Ask HN: How to read a terabyte flat file - calgaryo

For some reason - an application spews some results into a single file. What do you think would be the best way of reading (multiple) from the file?
======
gdp
I don't think I understand the question. Are you actually asking how to read
from a terabyte flat file, or are you asking how to process a terabyte of
sequential data? They are two (related) but distinct questions.

~~~
tamersalama
Sorry for not being clear the first time. Proper processing is what I'm after.
The reading has to be accompanied by parsing. The file is divided into
sections, and each section will have its own parsing / user-actions /
processing / output.

~~~
gdp
Well, the main thing would be to operate in a store-nothing kind of way. There
has to be a fixed amount of storage required no matter what the length of the
input is. You'll need to keep aggregate totals and compute over each row, as
opposed to storing any data.

The parsing should be simple. Lex and/or Yacc should handle very large files.
If not, you can always write something by hand, once again sticking to the
store-nothing principle.

I would think about it as a fold operation
([http://en.wikipedia.org/wiki/Fold_%28higher-
order_function%2...](http://en.wikipedia.org/wiki/Fold_%28higher-
order_function%29)) over the lines in the file, where the parser state and any
aggregate calculated values are stored in the accumulator, and each new line
is considered in conjunction with the current state and the previous
calculated values.

------
nshah
Depending on language restrictions, you may be able to implement read streams
that make a pass through the file and create appropriate call-backs when
hitting each section...

------
wmf
First, get the file into Hadoop. From there the parsing and processing should
be easier.

