is the database server running on a $700 workstation? how many rows? what types of queries? what is a typical query execution time? interested in your response because I generally find RDBMS performance quite poor, although I've never used SQL server. Pandas gets to be fairly painful after the data size hits 10GB, in my experience. I do think you are missing how pandas fits into a data exploration pipeline for someone who knows how to code - there's plenty of tasks for which SQL+BI is not going to cut it.
Commodity virtualized server with 4 cores and 8GB of RAM, storage is on NAS. We have hundreds of these typical SQL instances.
db has lot of rows, around 20k rows logged per minute and the events are logged 24/7 three years.
Again, because the schema is well designed, I use clustered index on Date to filter and analyze and the engine actually never reads the whole db all the time. It actually read only the pages I need, and that's the benefit from millions of man-hours MSFT invested in optimizing its SQL engine.
typical response time depends on date period, I try to write queries that dont take more than a 5-10 secs to run.
if you have RDBMS performance problems - just hire an expensive DBA for a brief period and learn-learn-learn from her how to work with large data effectively. DBAs can optimize your db to run as fast as a your hardware I/O speed.
I just did a simple benchmark: 67 million rows, integers, 4 columns wide, with postgresql 10 and pandas.
pg 10
huge=# \timing on
Timing is on.
huge=# copy lotsarows from '~/src/lotsarows/data.csv' with csv header;
COPY 67108864
Time: 85858.899 ms (01:25.859)
huge=# select count(*) from lotsarows;
count
----------
67108864
(1 row)
Time: 132784.743 ms (02:12.785)
huge=# vacuum analyze lotsarows;
VACUUM
Time: 185040.485 ms (03:05.040)
huge=# select count(*) from lotsarows;
count
----------
67108864
(1 row)
Time: 48622.062 ms (00:48.622)
huge=# select count(*) from lotsarows where a > b and c < d;
count
----------
16783490
(1 row)
Time: 48569.866 ms (00:48.570)
pandas
In [2]: import pandas as pd
In [3]: %time df = pd.read_csv('data.csv')
CPU times: user 34.1 s, sys: 4.49 s, total: 38.6 s
Wall time: 38.7 s
In [4]: %time len(df)
CPU times: user 125 µs, sys: 19 µs, total: 144 µs
Wall time: 166 µs
Out[4]: 67108864
In [5]: %time ((df['a'] > df['b']) & (df['c'] < df['d'])).sum()
CPU times: user 1.74 s, sys: 135 ms, total: 1.88 s
Wall time: 1.88 s
Out[5]: 16783490
ok, so 20k * 60 minutes * 24 hours * 365 days * 3 years = 31,536,000,000 rows. You are querying 31.5 billion rows on a machine with 4 cores and 8gb ram? Are queries that return in 5-10 seconds running over the entire table? or small portions of it?
small portions of it, sometimes an hour or two, sometimes a day, or a week.
most of the times there are 5-6 other conditions.
pandas will have to full scan entire dataframe for any query to filter, while SQL uses index seek