Hacker News new | past | comments | ask | show | jobs | submit login
Loading Data into Pandas: Tips and Tricks You May or May Not Know (dataground.io)
49 points by spacejunkjim 3 days ago | hide | past | favorite | 12 comments

Not to taunt about the article, but the most important pandas parameters to me are `iterator=True` and `chunksize=x`, for streamed processing. Here's an example for processing a CSV file with 400 Million latitude and longitude coordinates.[1]

[1]: https://ad.vgiscience.org/twitter-global-preview/00_Twitter_...

This is a great mention, ty!

I find pd.read_sql pretty useful for integration with SQLLite too.


I find pd.read_sql and SQLAlchemy very handy sometimes.

I’ve mostly replaced pd.read_csv and pd.read_parquet with duckdb.query(“select * from ‘x.csv’) or duckdb.query(“select * from ‘y/*.parquet’).

It’s much faster because DuckDB is vectorized. The result is a Pandas dataframe.

Querying the Pandas dataframe from DuckDB is faster than querying it with Pandas itself.

Some good tips in here, I've find myself reaching for JSON/excel methods often.

Despite using it for years, I still haven't decided if pandas is poorly architected or if the clunkiness (for lack of better of term) is a result of the inherent difficulty of the tasks.

I seem to remember the author awhile back writing about how it was their first major project and there were a ton of things they'd learned and would like to change.

Another tip: use https://github.com/sfu-db/connector-x to load database query result to pandas without memory copy resulting in faster operation.

I wish I knew about json_normalize sooner. JSON is great, but needing to transform it to a CSV/Excel sheet is a toil. Great to know about this one-liner!!

I spent some time working on something called DataProfiler python library


The gist is that you can point to any common dataset and load it directly into pandas.

from dataprofiler import Data

data = Data("your_file.csv") # Auto-Detect & Load: CSV, AVRO, Parquet, JSON, Text, URL

I simply hate dealing with loading data, so it's my go-to.

My tip is to keep a dict of all the fields and the data types you expect them to be, particularly strings. In my company we have IDs that start with zeros, or are a mix of numbers and letters, and get interpreted as numeric types. I'm frequently pulling data out of the DW with the same fields, so I just have to use the dtype= arg point it to my dict and it takes care of that for me.

In my experience, the best way to load data into pandas is https://www.atlassian.com/software/bamboo

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact