
Ask HN: What is the most efficient way to process large text files in Python? - chirau
I have fairly large text files (at least 5GB) and I am doing analysis on them. All I want to do is tally totals in one column and then sort using the STANDARD LIBRARY ONLY. However, after researching on Google and StackOverflow, it seems there is no consensus on best approach.<p>I have tried using Ordered for, CSV, itertools, buffers but it still seems I am not utilizing all resources available to me to make it even faster. I&#x2F;O is ridiculously slow.<p>I have 32GB RAM and an Octacore setup. Should it really be taking me more than 10 min to process a single file? How best do i make sure I take full advantage of the memory and power of my machine?
======
eesmith
Why in the world do you want to use the standard library only? This is what
Pandas was designed for, with optimized C code.

What are you doing which is different from something like:

    
    
      totals = []
      for line in open(filename, "rb"):
        fields = line.split()
        totals.append(fields[2]) # or whatever
      totals.sort()
      ... do something ..
    

or using collections.Counter()

    
    
      with open(filename, "rb") as f:
        c = collections.Counter(line.split()[2] for line in f)
      print(c.most_common(10))

~~~
enz
Yes, Pandas is what you need. And its CSV processor is really fast compared to
the standard CSV module.

------
bufferoverflow
If you want speed, do multithreaded C or C++ or Rust or Go. Python is one of
the slowest languages out there, according to various benchmarks. Cython might
help you some.

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/performance/fasta.html)

[https://benchmarksgame-
team.pages.debian.net/benchmarksgame/...](https://benchmarksgame-
team.pages.debian.net/benchmarksgame/performance/spectralnorm.html)

(as you can see, Python is 50-100 times slower than C++)

------
detaro
stdlib only excludes a bunch of the common answers (i.e. this sounds like a
case where pandas etc would be useful, shifting work to native code). That
said, 10 min for 5 GB sounds bad - that's only 8 MB/s?

Generally, it's hard to answer in abstract what the problem might be. Some
starting points: You might be unnecessarily creating temporary objects - e.g.
large lists. If you don't actually need csv parsing, the csv module might be
slower than something simpler. A profiler might help you there. You can modify
your program for multiple cores, but that's IMHO a later step once you got the
baseline fast.

