Hacker News new | past | comments | ask | show | jobs | submit login
Databases for data scientists (josiahparry.com)
20 points by sebg 20 days ago | hide | past | favorite | 18 comments



I don't know many data scientists that aren't already comfortable with SQL and database concepts. Isn't that the primary data source for them to access at enterprise organizations?


Data scientists originally came from science, not database administration and business analytics, hence the science in the name.

Databases weren't big in science because they used to be row oriented and most questions in science are column oriented. Most workloads in science are of the map, filter, reduce variety. If you used a regular rdms for those it would have extremely poor performance and scalability.

Computers have caught up with most business sized datasets. Today you can fit every human begin alive in a database that can fit on a single Postgres database without too much trouble and even run queries against them. This was not the case in 2012 when the term was invented.


Yes, at least as far as I know. Lots of SQL, some Power BI.


I agree with the point in the post to be honest. I'm no data scientist, but a programmer and in my point of view it's always good to use less resources. Considering how much data data scientists have to work with, I can imagine that memory limitations can be quite frustrating.

Nontheless, every person working with data and programming should know database concepts.


Getting inputs and publishing results very often means working with databases, and that choice is often not up to the data scientist. What you do in between is another story, but at least some db skills are essential…


Need one... for what? You should be using a database to keep track of experiment results, model lineage etc. even if it's just a sqlite file.


> You should be using a database to keep track of experiment results, model lineage etc. even if it's just a sqlite file.

You should be keeping track of those things, but I see little benefit and lots of downsides in using an SQL database to do so.


What are the downsides? Do you work in a team?


Hard to make it visible to others. Can't use standard VCS/diff/etc., so you don't get history unless you store it manually. Limited and cumbersome types, e.g. you can't use sets in a natural way (I think sqlite may now have some sort of support for JSON columns but they're clunky).


Fair enough if it’s not your tool of choice, but I think you may have missed a lot of what SQLite can do these days.

https://antonz.org/sqlite-is-not-a-toy-database/


I don't think any of that contradicts anything I said?


You can import and export CSV, JSON, etc (including parquet via an extension) which gets you VCS diffs for history. Set operations are built in (though I admit they may be as “natural” as you like), and there’s been decent (though, again, your idea of cumbersome may be different than mine) JSON blob support for some time.

It’s a tool that consistently gets underestimated at the same time it’s also consistently improving.


Database connection? No sir not me, please email me the xlsx from 2008 :)


LOL … based on true stories and the same person said „We need big data and use blockchain in our company


I work a lot with F500 companies which have been around for decades, and that hurts because it's so true.

I hate XLSX with a passion. The main issue is that as a file format it's too heavy, slow, clunky, and does weird data type conversions, but that's not the only problem.

The main problem of excel is "mission creep". It just does too many things in a half-assed way. So when you get a bunch of excel files to process for your data analysis task, you can't be sure someone didn't put a bunch of charts, merged cells, used formulas, used strange currency or unit formatting gimmicks, structured the data in a weird way to make it visually appealing to management, and so on. It's all so tiresome.


Can I dump my parquets in Azure data lake and query them with DuckDB?



Yeah, duck can query almost anything: https://www.definite.app/blog/query-any-ducking-thing




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: