Hacker News new | past | comments | ask | show | jobs | submit login

There are a few steps to consider before you are loading data into ram.

Can you partition the data in any useful way? For example if queries use separate ranges of dates, then you can partition data so that queries only need to touch the relevant date range. Can you pre-process any computations? Sometimes tricky things done within the context of multiple joins can be done once and written to a table for later use. Can you materialize any views? Do you have the proper indexes set up for your joins and filters? Are you looking at execution plans for your queries? Sometimes small changes can speed up queries by many orders of magnitude.

Smart queries + properly structured data + a well tuned postgres DB is an incredibly powerful tool.

Can I set up efficient indexes on parquet data to use with Spark, or is it necessary to use a DB?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact