I love love love Python for data science, in part because it's dynamically typed. I can bang things out quickly without worrying about the engineering bits, and, since I'm working in an interactive coding environment, it's generally easy enough to just inspect the values of my variables to figure out what they are.
I hate hate hate Python for ML engineering, in part because it's dynamically typed. The same features that make it so easy to hack out a quick data analysis make it absolutely awful to build for durability. For example, since stuff in production runs hands-off, you need to feel pretty confident about the return types of every function in order to feel confident you won't throw a type error at run time. Actually pinning this down can get quite complicated, though, when you're working with a library like scikit-learn that relies heavily on duck typing. Sometimes you end up having to go on a journey down a rabbit hole in order to clearly identify and document all the types your code might accept or return.
(Disclaimer: Hate aside, it's still my preferred ML engineering language. You've got to take the bad with the good, and the language gets you access to an ecosystem that is so very good.)
This is absolutely it. Untyped languages are great for glue-scripts, for exploration (of an API, a dataset, whatever), for quick-and-dirty things. As soon as your logic grows beyond "what can be appropriately expressed in <5 files" and/or "this is going to have a second developer", types become helpful.
I see it completely the other way around coming from the python, dynamically-typed side. For me, statically-typed languages have a benefit on the smaller-side of the scale, but absolutely bomb when the code-base grows. At that point everything is a FooFactory or an IInterface with no help from the IDE anyways because of IOC/DI/attribute reflection magic. And when it's that big, everyone argues over folder, package and inheritance hierarchies and the "right way" to refactor & reuse code, with the inevitable slide into yet another level of inheritance or interfaces. All the while peppered with Singletons, overloads and new abstract virtual base methods with complicated method override rules.
Obviously I exaggerate a bit, but we've all seen various incarnations of a lot of those issues.
The second you’ve “engineered” yourself into losing good IDE support half the benefit or using a strongly typed language goes out the window in my opinion. Though maybe some bias because I make an IDE! :)
Happily with TS it’s possible to have DI and IInstantiationService’s and all that and still maintain good IDE support —- in no small part because the IDE is built with all those, in TS... if it was unusable we’d fix it.
IMO dataframes are the reason why dynamic typing fits data science so well. It's certainly possible to represent a single dataframe as a static type; but representing all the slicing, column removal, joins, etc. is actually pretty hard without dependent tricks. So bypassing types for data frames is preferable. On your ML engineering point, the other side of it is that once your dataframe's schema is finalizes it really should be statically typed so that assumptions can safely be made about what is/isn't inside of it
> representing all the slicing, column removal, joins, etc. is actually pretty hard without dependent tricks
Disagree in the strongest possible terms, tbh.
It's the lack of static typing that gets you 3/4 of the way down your experimental pipeline only for your code to fail because column "trianing_batch" can't be found. Huge productivity loss, even with rapid iteration.
We must work very differently. I couldn't fathom that happening to me, if only because I compulsively peek at samples of the data frame every step of the way, in order to make sure the data look reasonable all the way through.
I love love love Python for data science, in part because it's dynamically typed. I can bang things out quickly without worrying about the engineering bits, and, since I'm working in an interactive coding environment, it's generally easy enough to just inspect the values of my variables to figure out what they are.
I hate hate hate Python for ML engineering, in part because it's dynamically typed. The same features that make it so easy to hack out a quick data analysis make it absolutely awful to build for durability. For example, since stuff in production runs hands-off, you need to feel pretty confident about the return types of every function in order to feel confident you won't throw a type error at run time. Actually pinning this down can get quite complicated, though, when you're working with a library like scikit-learn that relies heavily on duck typing. Sometimes you end up having to go on a journey down a rabbit hole in order to clearly identify and document all the types your code might accept or return.
(Disclaimer: Hate aside, it's still my preferred ML engineering language. You've got to take the bad with the good, and the language gets you access to an ecosystem that is so very good.)