

Processing large files, line by line - rayvega
http://fastml.com/processing-large-files-line-by-line/

======
wting
This is an excessively long blog post that basically states: do stream
processing when your data set doesn't fit into memory.

    
    
        with open('in.txt') as input, open('out.txt') as out:
            for line in input.readlines():
                out.write(foo(line))
    

Python users are used to importing everything all at once, while in C
everything is done in small chunks whenever possible.

Python 3 is also moving into this direction by replacing many default
functions with their iterator equivalents (map, range, etc).

You might think that this means forcing everything into one big context
manager, but that's not necessarily true. For example:

    
    
        from itertools import imap
    
        def read_file(filename):
            with open(filename, 'r') as f:
                reader = csv.reader(f)
                for line in reader:
                    yield line
    
        def write_file(filename, data):
            with open(filename, 'w') as f:
                writer = csv.writer(f)
                map(writer.writerow, data)
    
        write_file(
            filename='out.txt',
            data=imap(foo, read_file('in.txt')))

~~~
d0mine
Don't use `for line in file.readlines():`, do `for line in file:` instead.

Don't use `map(writer.writerow, data)`, do `writer.writerows(data)` instead.
Use binary mode for csv files in Python 2, use newline='' in Python 3.

You forgot to open the first output file in the write mode.

------
csense
There was an article a few weeks ago on the front page of HN about an
interview question for data scientists that was essentially the "exact-split"
problem mentioned in the end of the article. The article (or maybe it was the
comment thread) showed an algorithm to randomly split a file of size m+n into
disjoint sublists of size m and n, using a single pass through the data and
O(n) memory.

This blog post's algorithm accomplishes the same task with two passes and
O(m+n) space. It seems odd that an article explicitly about encouraging
readers to think like the authors of UNIX and make simple reusable utilities
that process stream data would use a two-pass algorithm when a fairly simple
one-pass algorithm is available.

~~~
ZygmuntZ
You need to know m and n, or a number of lines, beforehand for a one pass
version, don't you?

~~~
csense
Thinking about it some more, I think the previously mentioned problem had only
m unknown, n was known ahead of time.

------
walshemj
Give me strength Big data is NOT when its to large to fit in primary memory.

~~~
walshemj
I see the hobby programmers are out in force today

