Hacker News new | past | comments | ask | show | jobs | submit login

I just did a simple benchmark: 67 million rows, integers, 4 columns wide, with postgresql 10 and pandas.

pg 10

  huge=# \timing on
  Timing is on.
  huge=# copy lotsarows from '~/src/lotsarows/data.csv' with csv header;
  COPY 67108864
  Time: 85858.899 ms (01:25.859)
  huge=# select count(*) from lotsarows;
    count                               
  ----------                            
   67108864                             
  (1 row)                               
                                        
  Time: 132784.743 ms (02:12.785)       
  huge=# vacuum analyze lotsarows;      
  VACUUM                                
  Time: 185040.485 ms (03:05.040)       
  huge=# select count(*) from lotsarows;
    count                               
  ----------                            
   67108864                             
  (1 row)                               
                                        
  Time: 48622.062 ms (00:48.622)        
  huge=# select count(*) from lotsarows where a > b and c < d;
    count                       
  ----------                    
   16783490                     
  (1 row)                       
                                
  Time: 48569.866 ms (00:48.570)

pandas

  In [2]: import pandas as pd                                           
                                                                        
  In [3]: %time df = pd.read_csv('data.csv')                            
  CPU times: user 34.1 s, sys: 4.49 s, total: 38.6 s                    
  Wall time: 38.7 s                                                     
                                                                        
  In [4]: %time len(df)                                                 
  CPU times: user 125 µs, sys: 19 µs, total: 144 µs                     
  Wall time: 166 µs                                                     
  Out[4]: 67108864                                                      
                                                                        
  In [5]: %time ((df['a'] > df['b']) & (df['c'] < df['d'])).sum()       
  CPU times: user 1.74 s, sys: 135 ms, total: 1.88 s                    
  Wall time: 1.88 s                                                     
  Out[5]: 16783490



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: