Excellent library for train_test_split.
Jokes aside. This next to Numpy, Pandas Jupyter and Matplotlib + the DL libraries are the reason Python is the powerhouse it is for Data Science.
I'm with you on sklearn, the DL libraries and Numpy, but Pandas and Matplotlib are poor, poor relations of the tools available in the R ecosystem (dplyr/ggplot etc).
I used to very strongly agree with you re: matplotlib, but I've recently switched from using almost exclusively ggplot2 to almost exlusively Matplotlib and my realization is that they are very different tools serving very different purposes.
ggplot2 is obviously fantastic and makes beautiful plots, and very easily at that. However it is definitely a "convention over configuration" tool. For 99% of the typical plot you might want to create, ggplot is going to be easier and look nicer.
However matplotlib lib really shines when you want to make very custom plots. If you have a plot in your mind that you want to see on paper, matplotlib will be the better tool for helping you create exactly what you are looking for.
For certain projects I've done, where I want to do a bunch of non-standard visualizations, especially ones that tend to be fairly dense, I prefer matplotlib. For day to day analytics ggplot2 is so much better it's ridiculous. The real issue is that Python doesn't really offer anything in the same league as ggplot2 for "convention over configuration" type plotting.
Fully agree on Pandas. R's native data frame + tidyverse is world's easier. Pandas' overly complex indexing system is a persistent source of annoyance no matter how much I use that library.
> Fully agree on Pandas. R's native data frame + tidyverse is world's easier. Pandas' overly complex indexing system is a persistent source of annoyance no matter how much I use that library.
Is it just the syntax/readability that annoys you, or are there actually problems that need like n steps more to do the same with Pandas?
I spend more time working around panda's strange isms than it takes me to write vanilla python that does the same thing. The index problems are not just a small annoyances, and sometimes can waste hours because of its awkward defaults. For example, its default in df.to_csv to write an index (without a column name..)! It doesn't make any sense to me whatsoever that reading a csv, then writing the csv would add a new column. I'm really tired of rerunning pandas code after I forget to turn that stupid default index setting off. Is that a small thing? Sure. But it had tons of small things like that.
Wait how many companies are actually using R in the wild? As I understand it, R is born of academia, great for statistics/analysis but breaks down on data manipulation and isn't used in production/data engineering. Maybe my understanding is dated though?
Of the many companies I've done data science with I can only think of a few, rare exceptions where R wasn't used as much as if not more than Python.
If you're mostly dealing with Neural Nets you won't see much R, but for anything really statistical in nature R is a much better tool than Python. For anything that ends up in a report R is much better than Python (a lot of very valuable data science work ends up being a report to non-technical people).
> breaks down on data manipulation
This is very outdated. The tidyverse eco-system has bumped R back into being first in class for data manipulation now. This becomes less true as you get further and further from having your data in a matrix/df (I can't imagine doing Spark queries in R), but if you already have a basic data frame, manipulation from there is very easy.
Even for things that end up in production, whether you're in R or Python, whatever your first pass is should always be a prototype and will have to be reworked before you get close to moving it to production.
Wait how many companies are actually using R in the wild?
Depends on your definition. While not very often 'deployed' in 'production'. I know lots places in all kinds of industries where people reach for R as soon as they have to look at some new data.
R is everywhere, especially when you need to visualize stuff. It is primarily used in teams who are trying to get rid of SAS in my experience.
You are right in the sense that R is typically not used end-to-end as far as I can tell, but already tries to start with a data connection to some sort of dump or datalake, or datawarehouse.
Many people in my team use Python for modelling, but grab ggplot in whatever way to make their presentations and visuals (they all use different methods, usually something messy like mixing python and R in a notebook or so). GGPlots also has a vast library of super high quality plugins.
- You can brush over the date range to filter the bar chart
- You can click on weather type to filter the scatter chart
- It can be embedded in any webpage with these interactive elements in tact. Since the chart is represented by json and rendered by javascript, the spec also embeds the data within the chart itself, and allows the user to therefore change the chart however they want
There are Python ports of ggplot (e.g. plotnine (https://github.com/has2k1/plotnine)), but agreed, Python is behind here. I'm not the best at data viz, but I can usually piece together a way to make ggplot do what I want it to do without that much trouble or looking at documentation.
Matplotlib, though ... that's a harder beast to internalize. I know it's possible to make high-quality matplotlib plots, but it's much harder for me. Like pandas, it's a library that I don't want to denigrate because I know people put lots of effort into it, but I can't lie -- I'm not a fan.
It's mostly the "in production" part that determines whether R is suitable for a business or not. It's much more complicated to avoid runtime errors or do proper testing in R, whereas it shines for interactive use, or generating reports.
That said having used both the DSL's for plotting and data wrangling in the R package ecosystem are vastly superior to pandas and python plotting libraries. For modeling I actually like the better namespacing of Python which helps keep things more legible when there are a ton of model options to choose from, assuming you don't need cutting edge statistics.
> It's much more complicated to avoid runtime errors or do proper testing in R
It's not that much harder. There's no pytest, but testthat works well enough. I've developed a few packages internally in R and wouldn't say it was that much harder to ensure correctness than for the corresponding Python packages. (We used to keep them in sync, before basically moving everything to Python.)
I actually quite like R's error handling. It's as good as Common Lisp's which is often held up as the epitome of this.
You also have the dump.frames option, which will save your workspace on failure, which is incredibly useful when running R stuff remotely/in a distributed fashion.
> Wait how many companies are actually using R in the wild? As I understand it, R is born of academia, great for statistics/analysis but breaks down on data manipulation and isn't used in production/data engineering.
It depends, I've worked in some places where R was the core part of their data infrastructure. Data manipulation (of non text) is far, far better in R.
Integrating with other systems can be tricky though, and you don't have the wide variety of Python libraries available for core SE tasks, so it can often make sense to use Python even though it's not as good for a lot of the core work.
Additionally, R is a very, very flexible language (like Python), but without strong community lead norms (unlike Python) so it's pretty easy to make a mess with it.
Finally, when you need to hand over stuff to software engineers, they vastly tend to prefer Python, so it often ends up being used to make this stuff easier.
Like, in R there's a core tool called broom which will pull out the important features of a model and make it really easy to examine them with your data. There's nothing comparable in Python, and I miss it so so much when I use Python.
That being said, working with strings is much much nicer in Python, and pytest is the bomb, so there's tradeoffs everywhere.
> Additionally, R is a very, very flexible language (like Python)
I'd argue that R is much more flexible than Python syntactically. There's a reason that every attempt at recreating dplyr in Python ends in a bit of a mess (IMO) -- Python just doesn't allow the sort of metaprogramming you'd require for a really nice port. Something as simple as a general pipe operator can't be defined in Python, to say nothing of how dplyr scopes column names within verbs.
Arguably this does allow you to go crazy in a way that ends up being detrimental to readability, but I'd say overall it's a net benefit to R over Python. I really miss this stuff and have spent an undue amount of time thinking of the best way to emulate it (only to come up with ideas that just disappoint).
> Finally, when you need to hand over stuff to software engineers, they vastly tend to prefer Python
Indeed, this is maybe 50% of the reason my organization has pushed R to the sidelines over the past few years. We used to be very heavily into R but now it has "you can use it, but don't expect support" status.
> Well that's just lazy evaluation of function arguments, which can't be done in Python.
"Just lazy evaluation"! :) It's a pretty big deal. This is three-fifths of the way to a macro system.
> But if take a look at the Python data model, it does seem super, super flexible.
Sure, you can have a lot of control over the behavior of Python objects (some techniques of which remain obscure to me even after using Python for many years). But you don't have anything like syntactic macros. You can define a pipe operator with macropy, though -- it's pretty easy. But macropy is basically dead now I think (and a total hack).
> You'll still need strings for column names in any dplyr port though, because of the function argument issue.
This is major, though, because you can't do this:
mutate(df, x="y" + "z")
You have to do something like what dfply does, defining an object that defines addition, subtraction, etc.
mutate(df, x=X.y + X.z)
But that hits corner cases quickly. What if you want to call a regular Python function that expects numeric arguments? This won't work:
mutate(df, x=f(X.y))
etc. Granted, this only really works in R because it's easy to define functions that accept and return vectors. So in that sense it's kind of a leaky abstraction. But you couldn't even get that far in Python, because X.y isn't a vector ... it's a kind of promise to substitute a vector.
Give Python macros, I say! To hell with the consequences!
> Sure, you can have a lot of control over the behavior of Python objects (some techniques of which remain obscure to me even after using Python for many years). But you don't have anything like syntactic macros.
Nice, I'd love for this to see the light of day. I suspect it'll see some resistance (even pattern matching caused conflict, and I thought that was terribly innocuous).
(Why can I reply at this level of nesting now, whereas before I couldn't?)
> I'd argue that R is much more flexible than Python syntactically. There's a reason that every attempt at recreating dplyr in Python ends in a bit of a mess (IMO) -- Python just doesn't allow the sort of metaprogramming you'd require for a really nice port. Something as simple as a general pipe operator can't be defined in Python, to say nothing of how dplyr scopes column names within verbs.
Well that's just lazy evaluation of function arguments, which can't be done in Python. But if take a look at the Python data model, it does seem super, super flexible. You'll still need strings for column names in any dplyr port though, because of the function argument issue.
Like, both Python/R derive from the CLOS approach (Art of the Metaobject Protocol), but R retains a lot more of the lispy goodness (but Python's implementation is easier to use).
Hehe used to do R IMO you are right about ggplot but I strongly disagree about pandas. I fing love it. Would love to understand you troubles with it though, after using it for 4 years daily mabye I can offer some perspective ;)
Pandas indexing system is overly complex and I've never personally benefited from that. To start with there are __getitem__, loc and iloc approaches to accessing values. If your library constantly has to warn users that "you might being something wrong, read the docs!" that should be a warning sign that you don't have the correct level of abstraction. R has a much more sane api and assumptions about when you want to access a value by reference (which is almost always) and by value.
Then when doing basic operations like "group by" you end up excessively elaborate indexes that are in my experience useless and always need to be manually squashed to something coherent.
It's a common joke for me that whenever even a seasoned Pandas user cries out "gaarrr! why isn't this working!?" I just reply "have you tried reset_index?"... this works in a frighteningly large number of cases.
I don't mean to disparage pandas, which is a library that does a lot of things fairly well. But as an API for data manipulation I find it very verbose and it doesn't mesh with a "functional" way of thinking about applying transformations.
Generally, I've even preferred Spark to pandas, though it's hardly less verbose. Coming from R, it's much slower than data.table and nowhere near as slick and discoverable as dplyr. Its system of indices is a pain that I'd rather not deal with at all (and, indeed, I can't think of another data frame library that relies on them). I hate finding CSVs that other data scientists have created from pandas, because they invariably include the index ...
Handles time series really well, though.
Recently I've been using polars (https://github.com/pola-rs/polars). As an API I much, much prefer it to pandas, and it's a lot faster. Comes at the cost of not using numpy under the hood, so you can't just toss a polars data frame into a sklearn model.
That being said:
> I hate finding CSVs that other data scientists have created from pandas, because they invariably include the index ...
This is also default in R, with row numbers (like I have ever needed them). To be fair, it's gotten better since people stopped putting important information in rownames.
Polars looks interesting, thanks for the recommendation!
Ideally you should be using the parquet format which will use the binary format, preserve column types and indexes [df.to_parquet(<file>); df = pd.read_parquet(<file>)]
You can get away from a lot of problems by simply avoiding text files
I run into pandas edge cases all the time. pd.concat() failing on empty sequences (just let me specify a default for that case please); .squeeze() not letting me say, "squeeze down to a series but not a scalar"; .groupby().apply() returning different types depending on how many groups/rows per group there are... it's fine when you know exactly what you have but it's hard using it in a pipeline with that needs to be agnostic about whether there's zero, one, or many data (datums?).
It reminds me of base R from 2010, and i thought dplyr had driven a stake through the heart of those approaches.
More generally, the API is large, all-consuming and not consistent. sklearn is best in class here, I rarely need to look things up whereas the pandas docs autocomplete in my browser after one or two characters.
Matplotlib is my go-to despite being mediocre. I recently found proplot library built on it which seems to solve a lot of the warts (particularly around figure layout with subplots and legends). I haven't had a chance to use it yet - does anyone know if it's worth it?
I like to stick to basic, widely used tools when possible so I'm biased against it versus just wrangling it out with matplotlib. But proplot does look compelling, like it was written for exactly my complaints.
I'm surprised you dont like pandas. I've found it to be a pretty easy to use and useful tool and you can almost always use something like DASK (or if youre lucky CUDF from rapidsai) if you need better performance.
I will say that my very first "real" programming experience was Matlab at a research internship, so maybe i just got used to working in vectors and arrays for computational tasks.
> i just got used to working in vectors and arrays for computational tasks.
Have you worked with R? R, like matlab, natively supports vector based operations. In fact, all values in R are vectors. Many of the problems with Pandas ultimately boil down to the fact that you have to replicate this experience without truly being in a vector based language.
If you're doing data science aren't sklearn, DL, and numpy getting you 90% of the way there anyway? Even if R has better "versions" of pandas/matplotlib (not conceding that point) it's not exactly central to the job of data science.
As a working data scientist I'd say it's completely the opposite: a good tabular data manipulation package is the single most valuable tool in my tool box. And R's packages (either data.table or dplyr) are definitely way better than pandas. There's no comparison.
I would be hard-pressed to find a working data scientist whose definition of data science is "that thing you do with sklearn, Deep Learning and Numpy".
Tabular data is great for many usecases, but saying that image, audio, and video analysis is not data science seems like a weird variant of gatekeeping to me.
> Tabular data is great for many usecases, but saying that image, audio, and video analysis is not data science seems like a weird variant of gatekeeping to me.
Most problems are mostly tabular, IME.
I completely agree that text, images and video are much, much better handled by Python (that's why I use and know both).
> "Data science is that thing where you do sklearn, Deep Learning and Numpy" is not a working data scientist's perspective.
It could be. It's such a broad job title and it looks so different across different companies and teams that the main tool for one data scientist might be something that another data scientist never has to touch. Different data science jobs prioritise different tools, that's all.
Right, so defining data science as 90% sklearn+DL+numpy is just as silly as saying that it's 90% table manipulation. That's exactly my point.
Still, if anyone here has managed to find a data science job in which tabular data management is not a sizable piece of what you do, I'd like to know some details!
I imagine there are data scientists who operate primarily on unstructured rather than tabular data. Part of my current job involves stuff like text classification, and it's not that difficult to imagine someone for whom that's a more sizable proportion of their day-to-day.
Still, my suspicion -- at least from my corner of data science -- is that such individuals are rare, and that most data scientists do make use of tabular data more often than not.
I totally get what you mean - I would suspect that when you work with unstructured data, tabular data manipulation is maybe 20-40% of what you do, and when you work with structured data, it's more like 60-80%.
I worked as a datascientist for a couple of years and tabular data was a very small part of my job. I spent far more time with image-analysis and JSON, both of which I found R sucks at.
maybe we are casualties of the vague definition of "data science," but in my experience numpy is too low-level for most of what I consider DS, and pandas/matplotlib are _much_ more central than sklearn or pytorch. Even if your definition only encompasses deep learning research, surely plotting is still indispensable?
I'll also add my vote for the superiority of data.table and ggplot2 to any Python alternatives. the bloat and verbosity of pandas is a daily struggle
Just curious. In which way is data.table superior to pandas? Really interested about it! From my personal experience pandas is just sometimes a bit slow.
I'm more a dplyr man myself, but data.table is much faster than pandas, most noticeably IMO when reading large files. It's also extremely succinct if you're into that sort of thing (though I find it a bit obfuscated). pandas is a lot of things, but "fast" and "concise" are not two of them.
Got it. Regarding fast you have something like Vaex on python side (but not sure how fast it realy is). For me I had with pandas the most issues using it's multiindex.
> For me I had with pandas the most issues using it's multiindex.
Yessss. I loathe indices, and have never been in a situation where I was better off with them than without them.
> Regarding fast you have something like Vaex on python sid
I've never used Vaex, but I've used datatable (https://github.com/h2oai/datatable) and polars (https://github.com/pola-rs/polars). Polars is my favorite API, but datatable was faster at reading data (Polars was faster in execution). I'll have to give Vaex a try at some point.
Pandas is the PHP of data science. Pretty badly designed, but immensely popular because it got there first and had no real competition (in Python) for years.
> If you're doing data science aren't sklearn, DL, and numpy getting you 90% of the way there anyway?
Not really, tbh. Most of my jobs (even when the primary output was models) require spending a _lot_ of time data wrangling and plotting. R is much, much better for this kind of exploratory work.
But if I need to integrate with bigger systems (as I normally do), there's a stronger push for Python to reduce complexity and make it easier for SE's to understand and maintain (some of) the code.
Early on, pandas made some unfortunate design decisions that are still biting hard. For example, the choice of datetime (pandas.Timestamp) represented by a 64-bit int with a fixed nanosecond resolution. This choice gives dynamic range of +- 292 years around 1970-01-01 (the epoch). This range is too small to represent the works of William Shakespeare, never mind human history. Using pandas in these areas becomes a royal pain in the neck, for one constantly needs to work around pandas datetime limitations.
OTOH, in numpy one can choose time resolution units (anything from attosecond to a year) tailoring time resolution to your task (from high energy physics all way to astronomy). Panda's choice is only good for high-frequency stock traders, though.
The problem is not with the Wes' original decision but with the fact that it was never revisited even when pandas took off at much larger scope. Should had been fixed before 1.0 release.
I'm glad you posted about this because I didn't know, but my reflexive response was 'well guess that won't work for [project idea], guess I'll roll my own or just use the NumPy version.'
I personally don't mind the lack of one-size-fits-all. If Pandas were to be part of the Python Standard Library I think you'd have a stronger argument, since the unspoken premise of a SL is that you can leave for a desert island with only that and your IDE and still get things done.
Most data is not 300 years old or in the distance future, in fact ranges 1970+-292 years are very common. That is to say, panda's choice is good for lots of people, including outside high-frequency stock traders.
> Most data is not 300 years old or in the distance future, in fact ranges 1970+-292 years are very common.
In what domains? Astronomy, geology, history call for larger time range. Laser and High Energy physics need femtosecond rather than nanosecond resolution. My point is that a fixed time resolution, whatever it is, is a bad choice. Numpy explicitly allows to select time resolution unit and this is the right approach. BTW, numpy is pandas dependency and predates it by several years.
Best documented library. It even provides examples, guidance and best practices in the documentation. Have rarely learned so much as when I went through the sci-kit documentation. Absolute delight.
Really, for any other ML library the best documentation is how-tos spread through the web, but scikit-learn leaves very little room for that kind of content.
Great that they finally added quantile regression. This was sorely missed.
I’m still hoping for a mixed-effects model implementation someday, like lme4 in R. The statsmodels implementation can only do predictions on fixed effects, which limits it greatly.
I’ve always wondered why mixed effect type models are not more popular in the ML world.
scikit-learn (next to numpy) is the one library I use in every single project at work. Every time I consider switching away from python I am faced with the fact that I'd lose access to this workhorse of a library.
Of course it's not all sunshine and rainbows - I had my fair share of rummaging through its internals - but its API design is a de-facto standard for a reason.
My only recurring gripe is that the serialization story (basically just pickling everything) is not optimal.
I recently ran into this issue as well. Serialization of sklearn random forests results in absolutely massive files. I had to switch to lightgbm, which is 100x faster to load from a save file and about 20x smaller.
There is so much wrong with the api design of sklearn (how can one think "predict_proba" is a good function name?). I can understand this, since most of it was probably written by PhD students without the time and expertise to come up with a proper api; many of them without a CS background.[1]
These seem like minor gripes (reading your link) - and I don't even agree with them, seems like an ok use of mutable state (otherwise a separate object would be needed for hyperparameter state?). Maybe my expectations are low, but they way sklearn unifies the API across different estimators all across the library - that's already way above what you can expect - especially if you consider it to be "written by a bunch of phd students".
I didn't want to bag on sklearn (I've already bagged on pandas enough here), but for what it's worth I agree with you. It's, ahh, not the API I would've come up with. It's what everybody has standardized on, though, and maybe there's some value in that.
NN as in "neural network", or NN as in "nearest neighbour" algorithm? No to the former, yes to the latter. The reason for a "no" to neural networks - in my case I've only ever implemented neural networks with many layers, and typically using kernels, pooling mechanisms, etc, and since scikit-learn doesn't have GPU support, I opt for frameworks that do (PyTorch, TensorFlow). However, if you're only building fully-connected neural nets (MLPs), with just a few layers, you don't need GPU support since any benefits of having parallel processing are offset by shuffling data between CPU and GPU. So in that case, scikit-learn would probably work quite well, although I never tested this myself.
GPU can be useful for Nearest Neighbour as well. In case you have access to a GPU, I would strongly recommend Facebook's FAISS [1,2]. For everything else, sklearn is amazing.
I have used the MLP classifier[1] before. It's very simple to use (like most of sklearn's models). Worked well for standard and reasonably small classification model, but lacks some features for it to be a flexible way of using NNs:
- No saving checkpoints (can be crucial for large models who need alot of compute and time)
- No way to assign different activation functions to different layers
- No complex nodes like LSTM, GRU
- No way to implement complex architectures like transformers, encoders etc
I also do not know if its even possible to use CUDA or any GPU with it.
AFAIU, there are not Yellowbrick visualizers for PyTorch or TensorFlow; though PyTorch abd TensorFlow work with TensorBoard for visualizing CFG execution.
> Many machine learning libraries implement the scikit-learn `estimator API` to easily integrate alternative optimization or decision methods into a data science workflow. Because of this, it seems like it should be simple to drop in a non-scikit-learn estimator into a Yellowbrick visualizer, and in principle, it is. However, the reality is a bit more complicated.
> Yellowbrick visualizers often utilize more than just the method interface of estimators (e.g. `fit()` and `predict()`), relying on the learned attributes (object properties with a single underscore suffix, e.g. `coef_`). The issue is that when a third-party estimator does not expose these attributes, truly gnarly exceptions and tracebacks occur. Yellowbrick is meant to aid machine learning diagnostics reasoning, therefore instead of just allowing drop-in functionality that may cause confusion, we’ve created a wrapper functionality that is a bit kinder with it’s messaging.
> cuML is a suite of fast, GPU-accelerated machine learning algorithms designed for data science and analytical tasks. Our API mirrors Sklearn’s, and we provide practitioners with the easy fit-predict-transform paradigm without ever having to program on a GPU.
> As data gets larger, algorithms running on a CPU becomes slow and cumbersome. RAPIDS provides users a streamlined approach where data is intially loaded in the GPU, and compute tasks can be performed on it directly.
CuML is not an NN library; but there are likely performance optimizations from CuDF and CuML that would accelerate performance of NNs as well.
Dask ML works with models with sklearn interfaces, XGBoost, LightGBM, PyTorch, and TensorFlow: https://ml.dask.org/ :
> Scikit-Learn API
> In all cases Dask-ML endeavors to provide a single unified interface around the familiar NumPy, Pandas, and Scikit-Learn APIs. Users familiar with Scikit-Learn should feel at home with Dask-ML.
dask-labextension for JupyterLab helps to visualize Dask ML CFGs which call predictors and classifiers with sklearn interfaces: https://github.com/dask/dask-labextension
> Dask-ML works with {scikit-learn, xgboost, tensorflow, TPOT,}. ETL is your responsibility. Loading things into parquet format affords a lot of flexibility in terms of (non-SQL) datastores or just efficiently packed files on disk that need to be paged into/over in RAM. (Edit)
>> Featuretools is a framework to perform automated feature engineering. It excels at transforming temporal and relational datasets into feature matrices for machine learning
> Creating a feature matrix from a very large dataset can be problematic if the underlying pandas dataframes that make up the entities cannot easily fit in memory. To help get around this issue, Featuretools supports creating Entity and EntitySet objects from Dask dataframes. A Dask EntitySet can then be passed to featuretools.dfs or featuretools.calculate_feature_matrix to create a feature matrix, which will be returned as a Dask dataframe. In addition to working on larger than memory datasets, this approach also allows users to take advantage of the parallel and distributed processing capabilities offered by Dask