Hacker News new | past | comments | ask | show | jobs | submit login

I think it might go deeper than that. I wear a lot of hats - part software engineer, part data engineer, part data scientist - and I find that the only perspective from which RDBMSes feel like the clear front-runner is when I'm wearing the software engineer hat.

When I'm wearing the data engineer hat, I start to get interested in some of the options that aren't even databases at all, such as storing the data in ORC files. Because they sometimes offer big performance advantages. The data intake pipelines I'm working on are batch-oriented, so they benefit little from how RDBMSes are optimized for random access to records, and maybe also ranges of records, but only at the cost of write performance if you're looking to filter on a natural key instead of a synthetic key. . . it gets thorny. ORC is less-than-stellar at selective ad-hoc queries, but I'm not doing selective ad-hoc queries against the internals of my data intake pipeline, and I will fight to the death to prevent the BI team from doing so.

When I'm wearing my data scientist hat, I've got no idea what my next query on the data is, so there's no indexing strategy that could possibly meet my needs. That kills one of the RDBMS's biggest value propositions. So let's just wing it with map-reduce and get on with life. And, TBH, SQL is just too declarative for the way my brain works when I'm in analyst mode - the semi-imperative workflow I get out of something like Spark SQL or R or Pandas feels more natural.






Have you considered snowflake or redshift? I've gone very far ahead with just simple SQL using these tools for a variety of DE and DS projects. Once even managed to get an ML model running on redshift.

If that is your jam, check out BigQuery' ML in SQL



Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact

Search: