Hacker News new | past | comments | ask | show | jobs | submit login
Ask HN: Data people, what things do you always look for in a new dataset?
4 points by jameskerr on Aug 7, 2023 | hide | past | favorite | 2 comments



Presuming the data is tabular and has rows and columns...

I look for a manifest or README file that usually explains what the columns are.

I look for columns that could be used as unique identifiers or could be primary/frgn keys in a db table.

I look at the names of all the columns to understand the domain and if I don't know what a column represents then I make a note of it to find out more.

I look for the data type used for each column.

I look for each numerical column what the range of values are, what are some basic stats - min/max/mean/mode/std.dev.

If the data is in a domain I know then I make a note of if each columns numerical values make sense (does a temperature of -9000 degrees make sense or is it a sensor malfunction / no-read value.)

I look for incomplete rows and if anything is blank, why is that?

I suppose if you understand all of those you should be ready to load the data into a db or for further analytics.

Practically you want to understand the magnitude of the data how many columns and rows does an average payload or batch contain?

Can the data fit in memory or not?

Does the data come in chunks or is it streamed somehow?


Thank you. This is helpful.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: