
Don’t Slurp: How to Read Files in Python - mssaxm
http://axialcorps.com/2013/09/27/dont-slurp-how-to-read-files-in-python/
======
j_baker
I like that you can slurp a file in two lines of Python. And for someone just
learning Python, the author's solution is just unnecessarily complicated. How
many people learning Python are going to have a need to optimize file reading
to this level?

And besides that, the author's solution is just good for one situation: when
you want to read something line-by-line, which isn't always the case. For
binary files, you may want to do something like this (untested code):

    
    
        data = infile.read(256):
        while data:
            do_something(data)
            data = infile.read(256)
    

Also, it seems like the author hasn't heard of enumerate:
[http://docs.python.org/2/library/functions.html#enumerate](http://docs.python.org/2/library/functions.html#enumerate)

~~~
paulgb
I agree that it's good to have an easy way to read a whole file in, but when I
think about it I can't think of any case where I've had to write code to read
a file that I _didn 't_ want to process line-by-line, in which case the non-
slurping method is actually _less_ code.

~~~
j_baker
It sounds like you've only read in line-by-line files then. I mean, it doesn't
make sense to read in a JSON or XML document line-by-line. Nor does it make
sense to read in most binary files line-by-line.

Some formats, such as csv do make sense to read in line-by-line though.

~~~
paulgb
I've read in lots of formats, including many XML and binary files. 99% of the
time there is a library that already handles the low-level interaction with
those files. The only times I've had to write original Python code that reads
files directly have typically been things that could be processed line by line
(eg various record-based data formats). I'm trying to think of a counter
example but I'm coming up blank.

------
gwu78
Assuming this is true (filters are faster than slurping) how do we explain the
perception of Perl and the many similar interpreters that followed as being
"faster than the shell" (i.e. UNIX standard utilities and pipes). Since the
dawn of Perl, and through the Python era, one could conclude that the shell
and UNIX utilities have been all but abandoned for doing work with large
files, in favor of using other interpreters and their myriad helper libraries.

My guess is that hardware improved and made slurping easier to do. Available
RAM increased as hardware improved. This allowed slurping to displace filters
as the choice way to work with files.

In a resource constrained environment, I still prefer filters to slurping. But
how many developers or users today perceive their environment as resource
constrained?

~~~
Terretta
Except that Perl is filter mode. In Perl

    
    
        while (<FILE>) { print $_; }
    

is line by line and efficient. Perl programmers avoid

    
    
        my @entire_file=<$yourhandle>;
    

But even that loads the array line by line.

As a side note, good Perl has always done things while there are things to do,
in a list processing kind of approach feeling more functional than imperative.

// Slurping the whole file into a single thing is actually a chore:
[http://stackoverflow.com/questions/206661/what-is-the-
best-w...](http://stackoverflow.com/questions/206661/what-is-the-best-way-to-
slurp-a-file-into-a-string-in-perl)

~~~
mct
If you're writing a filter, an even better pattern is:

    
    
            while (<>) {
                    ...
            }
    

By omitting the file handle, "<>" will go line by line through each filename
specified on the command line. If no filename arguments were passed, it reads
from STDIN.

------
scott_s
Others have already pointed out that looping over the lines in the file (which
uses iterators) is the more obvious way to do it. But, there's an even better
way. Check out "Generator Tricks for System Programmers":
[http://www.dabeaz.com/generators/](http://www.dabeaz.com/generators/)

It has been submitted to HN many times:
[https://www.hnsearch.com/search#request/all&q=generator+tric...](https://www.hnsearch.com/search#request/all&q=generator+tricks+for+system+programmers)

------
jdnier
It would also be more idiomatic to write

    
    
        for i, line in enumerate(sys.stdin):
            print '{:>6} {}'.format(i, line[:-1])

~~~
riskable
Unless you want your line numbers to start at 0, do this instead:

    
    
        import sys
        for lineno, line in enumerate(sys.stdin, 1):
            print('{:>6} {}'.format(lineno, line[:-1]))
    

The second argument to enumerate() is the 'start' (can also be a keyword). So
by passing a 1 we start there instead of 0.

------
cpjk
Forgive me, but doesn't file.readline() provide the same functionality for
reading a single line at a time from a file?

~~~
chaosphere2112
Yeah, but aren't we actually supposed to do this?

    
    
      for line in file:

~~~
scott_s
Yes, and that is the idiomatic way to read files in Python. I guess me and the
author are looking at different Python programs, because slurping the whole
file is not "by far the most common way" I have seen files read in the wild.

~~~
rprospero
Most of my code slurps the entire file at a time. I can't actually think of
any code we have that streams the code.

Then again, most of my files are three dimensional matrices. There's nothing I
can really do on a line by line basis.

------
mistercow
>It also happens to nearly always be the wrong way to read a file

It's the wrong way in many cases, but it's the right way in a very large
number of cases, if not in the majority of cases.

Often you need to read in a relatively small file, then do something trivial
with it, or toss it through a couple of library-provided string processing
functions. Or maybe the files are a bit bigger, and you're writing a script to
automate some grunt work. You expect to run this script once. Or, more
generally, maybe you're writing some code that gets called once a month and
takes less than a second to complete.

In any of those situations, it would be silly to do anything other than
slurping. String manipulation is easy to reason about. Stream processing is
not.

Also note that most OSes cache files in memory, so if you are reading the same
file often, the slowdown from reading the data into memory is drastically
reduced.

~~~
rsobers
Yup. There are literally millions of instances of code that slurps and Just
Works and the users don't care and the investors don't care and the servers
don't care and the programmer just moved on with life and knocked out the next
feature. And nobody cared and it never mattered that it was "wrong."

~~~
pyre
It could be used as a known vector to crash the program. Turn the file that it
reads into something that will fill up RAM unless Python has some size checks
before slurping.

~~~
mistercow
An attacker with that access can also DoS a program that doesn't slurp using
the same technique. Also, on a 64-bit system, a DoS is all you can
realistically get from this vector anyway. I don't know what happens if you
actually fill Python's available address space (it seems pretty difficult),
but I'd be shocked if it were a crash, and not an exception.

------
kevingadd
Generally good advice, but note that in some cases 'slurping' is actually much
better.

The most obvious one is where you can exploit parallelism (either CPU
parallelism or storage parallelism) by fetching multiple entire files at once
and preparing them in memory. This allows you to start spending CPU time
processing one loaded file, while other files load in the background. When you
stream a file one line at a time, other than some basic optimistic lookahead,
it's not really possible for the OS to do as much to help you there, so you're
going to be effectively single-threaded. If the computation you're doing on
the data is significant, you can end up being unable to even maximize your use
of a single core on computation.

------
bcoates
If you're writing a filter, you probably want to use fileinput, which does
most of the stuff you'd want in a "read lines from a bunch of files and do
something" text-processing program.

[http://docs.python.org/2/library/fileinput.html](http://docs.python.org/2/library/fileinput.html)

------
riskable
His example is OK but I was able to improve it significantly with a few minor
changes:

    
    
        "A simple filter that prepends line numbers" # <-- Docstring
        import sys
        for fname in sys.argv[1:]: # ./program.py file1.txt file2.txt ...
            with open(fname) as f:
                # This reads in one line at a time from stdin
                for lineno, line in enumerate(f, 1): # Start at 1
                    print '{:>6} {}'.format(lineno, line[:-1])
    

My way lets you pass as many files as you want to stdin, has a proper
docstring, and uses the enumerate() function (so you don't need the silly
`lineno = 0` and `lineno += 1` lines).

~~~
nobodysfool
nah, way better to use a generator...

    
    
        with open("a.txt") as f:
            c = ["{0} : {1}".format(x,y) for x,y in enumerate(f,1) ]
        for x in c:
            print x,

~~~
prutschman
c is assigned a list, not a generator. Switching the square brackets for
parenthesis creates a generator, but attempting to access the elements will
fail because f was closed after exiting the 'with' scope.

    
    
      <ipython-input-8-e2c5ebe72b17> in <module>()
      ----> 1 for x in c:
            2     print x,
            3 
      
      <ipython-input-7-9460e3a04a4e> in <genexpr>(***failed resolving arguments***)
            1 with open("/tmp/foo.txt") as f:
      ----> 2     c = ("{0} : {1}".format(x,y) for x,y in enumerate(f,1))
            3 
      
      ValueError: I/O operation on closed file

------
gargh
Doesn't the io module
([http://docs.python.org/2/library/io.html](http://docs.python.org/2/library/io.html))
do this without resorting to non-typical python code?

~~~
richardjs
Is it non-idiomatic to do "for [line] in [file object]"? I use "for line in
open('file')" all the time, for similar reasons as presented in the article.
"for line in sys.stdin" is basically the same pattern, just with a different
file object.

Edit: The idiom's mentioned in the Python docs on IO [1] as "memory efficient,
fast, and leads to simple code"

[1]
[http://docs.python.org/2/tutorial/inputoutput.html#methods-o...](http://docs.python.org/2/tutorial/inputoutput.html#methods-
of-file-objects)

~~~
icebraining
Nitpick: it's better to do "with open('file') as file: for line in file: ..."
instead, but otherwise yes, iterating over file objects is great.

Another option is using mmap[1], particularly when the file is already in
memory or you need more random access to it. It worked well when I was trying
to parse some lines from the end of an open log file.

[1]
[http://docs.python.org/2/library/mmap.html](http://docs.python.org/2/library/mmap.html)

~~~
richardjs
Is there a difference between the two? I assumed that without the "with", the
file would still be closed once the loop was exited (and thus the reference to
the file is dropped), but I'm open to the possibility that I'm mistaken.

~~~
tjgq
Without the context manager (with statement), the underlying file is closed
when the file object is garbage-collected.

In CPython, since reference counting is used for GC, this occurs when the loop
exits. However, other implementations (e.g. PyPy) may use different schemes
that do not guarantee collection as soon as objects go out of scope. As an
extreme, a valid and occasionally useful GC strategy is to never collect
anything at all [0]!

Hence, if you want to portably ensure the file is closed, you should either
use the context manager or call close() explicitly.

[0]
[http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047...](http://blogs.msdn.com/b/oldnewthing/archive/2010/08/09/10047586.aspx)

------
dfc
Was anyone else unfamiliar with `/usr/bin/jot`? It looks like it is an obscure
way of doing:

    
    
        $ seq 1 10000000
    

I think a lot of people forget about the beauty and power of coreutils.

~~~
mssaxm
the example was generated on a mac, which uses the FreeBSD userland utilities.
seq is not included in non-GNU user-land utilities (as it is not POSIX), jot
is the (more-or-less) equivalent of seq for BSD systems.

~~~
286c8cb04bda

      $ uname -s
      Darwin
      $ type seq
      seq is /usr/bin/seq
    

The man page says --

The seq command first appeared in Plan 9 from Bell Labs. A seq command
appeared in NetBSD 3.0, and ported to FreeBSD 9.0. This command was based on
the command of the same name in Plan 9 from Bell Labs and the GNU core
utilities. The GNU seq command first appeared in the 1.13 shell utilities
release.

------
webhat

        So the moral of the story is that Python makes it simple and elegant to write stream-processors on line-buffered data-streams.
    

I thought every language had this or a similar method as a best practice when
processing 'large' files.

------
jlujan
Another nitpick. The article doesn't specifically say text file. Who is to say
it has newlines at any reasonable spacing. Why is this front page.

------
samspenc
Very interesting. This is how Hadoop streaming handles file I/O as well.

------
keypusher
This is why you don't ask programming questions to LinkedIn.

------
wfunction
How in the world is this "faster" as stated?

~~~
pyre
If you're processing something one line at a time, and outputting something
based on each line, then you don't need to read-and-process the entire file
before printing everything out.

Other than that, it's more memory efficient.

~~~
wfunction
That's not "faster" (i.e. it doesn't take less time to run), it just has less
delay from when it send back the first piece of output.

~~~
pyre
As part of a larger system, removing the delay can cause other parts of the
system to do their processing in parallel. While this doesn't reduce the
amount of time that it takes that piece of the pipeline to run, it will reduce
the total runtime of the system.

It can also be faster if you're searching for something in the file, because
you can short-circuit reading the rest of the file when you find it.

