Dataframe don't really have a clear boundary as what it can do or cant do. I see no problem why you can't shoehorn some control flow logic into it which makes it pretty much just an executor of arbitrary computation graph specified in DSL, which relies on however the developer decides to implement it.
I'd rather take SQL because I have a better understanding what it is doing.
Easy in Pandas.
Dataframes are not tables; tables are not dataframes. It's nearly as apples-to-bananas as comparing Python lists to C arrays.
pandas is just poor man's SQL+BI. pandas stores everything in memory and has many limitations.
in SQL Server I can easily churn through terabyte sized database and get the data I need, because the schema is well designed with partitioned tables, clustered indexes and a well designed SQL takes less than a second to run against >1TB database. It even allows a team of more than 20 people to work with the same SQL and query it simultaneously.
i would love to see how you can analyze 1TB csv file, or a pile of 100 csv files totalling over 10TB where you have yet to discover the schema and how tables join together. and I am doing it with on a simple $700 workstation, not even using any hadoop nonsense
I don't know... I honestly feel like 'BI-Solutions' are a poor-persons Python if you are doing anything more than simple dashboards. Something that can be done in 2 lines of code in a Notebook requires endless fiddling in an IDE, to produce something not easily reproducible.
Aside, I've no experience with Tableau or Power-BI, just know that Crystal Reports and SSRS which are pretty painful.
cherry pick what your need and ETL your data out of legacy systems into your warehouse and run something like tableau/looker/powerbi on top and you will be amazed how effective you can be
Then, once you have 'insight' into your Data, you can easily 'do' something with it without the limitations of a tiered product.
db has lot of rows, around 20k rows logged per minute and the events are logged 24/7 three years.
Again, because the schema is well designed, I use clustered index on Date to filter and analyze and the engine actually never reads the whole db all the time. It actually read only the pages I need, and that's the benefit from millions of man-hours MSFT invested in optimizing its SQL engine.
typical response time depends on date period, I try to write queries that dont take more than a 5-10 secs to run.
if you have RDBMS performance problems - just hire an expensive DBA for a brief period and learn-learn-learn from her how to work with large data effectively. DBAs can optimize your db to run as fast as a your hardware I/O speed.
huge=# \timing on
Timing is on.
huge=# copy lotsarows from '~/src/lotsarows/data.csv' with csv header;
Time: 85858.899 ms (01:25.859)
huge=# select count(*) from lotsarows;
Time: 132784.743 ms (02:12.785)
huge=# vacuum analyze lotsarows;
Time: 185040.485 ms (03:05.040)
huge=# select count(*) from lotsarows;
Time: 48622.062 ms (00:48.622)
huge=# select count(*) from lotsarows where a > b and c < d;
Time: 48569.866 ms (00:48.570)
In : import pandas as pd
In : %time df = pd.read_csv('data.csv')
CPU times: user 34.1 s, sys: 4.49 s, total: 38.6 s
Wall time: 38.7 s
In : %time len(df)
CPU times: user 125 µs, sys: 19 µs, total: 144 µs
Wall time: 166 µs
In : %time ((df['a'] > df['b']) & (df['c'] < df['d'])).sum()
CPU times: user 1.74 s, sys: 135 ms, total: 1.88 s
Wall time: 1.88 s
Less pedantically you can quite easily and elegantly drop nans, str->float, perform just about any operation over a rolling window and perform trig functions on it. Generally in a much more sane way than pandas.
In my life as an analyst and data scientist I've found SQL to be far, far superior to pandas for something like 90-95% of tasks.
Yes you can, Postgres support Python UDF btw. Though I don't think that is necessary a daily common feature to use.
SQL is pretty powerful if you look deep