
Another Book on Data Science – Learn R and Python in Parallel - zelda_1
https://www.anotherbookondatascience.com/
======
clarityPhone
I skimmed through the book, and think it does a very poor job at showcasing
how R and Python are juxtaposed in industry.

To be fair, the book advertises showing R and Python code side-by-side. And
that’s what it does. But it does it unlike how the languages are most often
used in industry.

As a quick example, I saw no tidyverse code, which is essentially the only
thing keeping R in the game. Learning R from this book won’t prepare you for
writing R in most R shops.

I don’t see the utility in knowing how to do the same thing in both python and
R if you’re a beginner. This is even more true if you’re not taking advantage
of the strengths/weaknesses of either language.

Instead, just learn one of the languages well, and then learn the other well.
Shallow dives in both will make you weak in both.

Unfortunately, 90% of data science content seems to be geared at beginners.

~~~
wjn0
I agree for the most part, but R does have a few things beyond the tidyverse:
built-in dataframe support, lots of domain-specific packages, more consistent
interfaces for basic statistics and machine learning models, etc. Python is
definitely better for matrices (because of NumPy) and anything involving
custom gradient descent methods (because of TensorFlow).

I think 90% of data science content is for beginners because anything more
advanced isn't best described as data science. As soon as you get beyond the
initial stages of data analysis (cleaning and processing data), you're doing
something best described as some other word (statistics, machine learning,
etc.) - although, granted, there isn't much content in these areas if you
don't know _exactly_ what you're looking for.

~~~
randomvectors
> built-in dataframe support

Not an advantage if you ask me - exactly because data.frame is built in,
people have been building their own versions (tibble, data.table) instead of
improving it. That's how R ended up with 3 different structures that are
similar but have inconsistent apis and behaviour.

> lots of domain-specific packages

That's true.

> more consistent interfaces for basic statistics and machine learning models

Can't disagree more - there is no one go-to library for ML in R (like sklearn
in Python) and each package has it's own strange interface and implementation.

~~~
RosanaAnaDana
You mean like keras? or tensorflow? Or base random forest. You know, like the
original Breiman implementation.

Python has utility. But R is ___far_ __superior in its the quality of the
packages, their documentation, their ability to behave predictably on a given
data type.

I run a machine learning shop. Right now all of the training, application, and
data management is handled via R. R is simply superior in too many ways for us
to be bothered with python for the scale of work we are doing.

Since we're moving some big applications to keras/ TF we do use python and
will be using more in the future. However, for almost all data management,
munging, movement visualization, reporting, its an R world.

~~~
randomvectors
> You mean like keras? or tensorflow? Or base random forest. You know, like
> the original Breiman implementation.

> ...

> Since we're moving some big applications to keras/ TF we do use python and
> will be using more in the future.

Not sure if I misunderstood, or you're contradicting yourself there.

> R is far superior in its the quality of the packages, their documentation,
> their ability to behave predictably on a given data type.

I not only disagree but I think that the exact opposite is true for each one
of these points. But if things are working well in our shop, I'm not going to
try to convince you otherwise.

~~~
RosanaAnaDana
My point behind the keras/ TF comment is that the libraries have front ends in
both python and R, so its mix mox/ dealers choice on what you like to work in
(since the backends of both are identical).

The primary reason to moving these to python is due to convenience/ the
community. Most new work is published in python. If we find a new/ interesting
model we want to implement, its probably written in python. Rather than reskin
the thing in its entirety, its easier here to work in python.

A couple disclaimers: my group works primarily in geospatial data, and
principally in LiDAR and multispectral imagery.

The coarse division I see between R/ Python, is that if you come from a
research/ academic background (non-engineering), you probably learned to
program in R. If you were an engineer, you probably learned matlab. If you are
self taught/ coursera/ youtube, you probably learned in python.

R libraries are generally more geared towards academic research, and
specifically, working within existing frameworks (handling geospatial data as
geospatial data rather then turning them into a numpy arrays). Working in
python, there is far more re-invention of the wheel, and its always a pain the
ass to get things back into the structures they came in as.

Python has huge utility and is an important tool for certain work. But its
really really not faster than R (it def used to be, this isnt the case any
more).

R has better support for more scientific programming than python.

~~~
euler_angles
> My point behind the keras/ TF comment is that the libraries have front ends
> in both python and R, so its mix mox/ dealers choice on what you like to
> work in (since the backends of both are identical).

Not as a point of argument, just additional information: R's support for keras
and TF is a wrapper around the Python interface to those libraries.

------
randomvectors
This just doesn't seem to have a place.

1\. It's aimed at beginners.

2\. If you're a beginner, you're best off picking one language and sticking
with it for a while.

3\. There are so many other beginner resources that are much better.

~~~
thewhitetulip
Can you please list a few?

I'm currently following ISLR book & course

~~~
randomvectors
Depends on what you're interested in, your goals and your starting point. You
have two main learning lanes:

1\. Theory.

Math, calculus, linear algebra, probability, statistics, ML algorithms. ISLR
is a very good beginner-ish resource that helps you understand the algorithms
but doesn't go too deep into the math. As you go deeper, you may realise that
you have gaps in your math knowledge and you need to cover a lot more
probability, calculus and linear algebra.

2\. Programming.

2.1. Good software engineering practices - writing maintainable code, design
patters, version control, unit testing etc.

2.2. Tooling - knowing the language-specific ecosystem of libraries (the OP is
an attempt to teach you this in two languages at the same time). This is what
most beginner resources focus on; your knowledge here has the least
transferability and tends to go out of date quickly. Still, using the right
tools and knowing them well goes a long way.

~~~
thewhitetulip
I'm sorry in advance if I'm taking too much of your time.

#2: coding is my hobby and have been writing well designed apps for a long
time, so thats not an issue

#3: ISLR is teaching how to do ML algo in R, so there goes that point

#1 is what I'd like more information. Good important is the maths to work as a
data scientist? I'm planning ISLR and then maybe ESL or some advance course

A few of my friends work on Data science and they said that maths isn't that
important as in, one needs to know the formulas and why things work the way
they do as in not rote learning the math.

It'd be great if you can list down intermediate courses, the learning market
has drowned good tutorials and books with not so good guides!

~~~
randomvectors
> then maybe ESL or some advance course

> A few of my friends work on Data science and they said that maths isn't that
> important

Without the math you won't understand anything in ESL.

Which might be okay if the job doesn't require you to go into that much depth
- some data science jobs are more focused on research (very math-heavy), some
on ETL and/or engineering, others on business understanding and communication;
it's a really broad title.

~~~
thewhitetulip
Thank you. That's what I was thinking. The jobs I'll apply won't be math
oriented.

------
anonu
What does the HN community think about Python for statistical computing?

Core Python is fine. But pandas is an atrocious mess of object orientedness
and other weird stuff.

------
omnimkar69
I also skimmed through the book and it does very poor work in case of python I
think they have to change this concept

------
thatcat
I'd be interested in a version that included SAS. Is SAS ever used outside of
academia?

~~~
randomvectors
SAS is still used by companies who don't want to use open source tools and
would rather pay a lot of money for an established product name and have a
support line that they can call if anything goes wrong. Understandably, it's
slowly dying out.

~~~
tomrod
Support, and also indemnification.

SAS lives on borrowed time.

------
Dkastro92
I think it's a good first impression of either languague

------
dlphn___xyz
i only read the chapter on optimization/linear programming - its far too brief
and misses key concepts for it to be useful. there are no real applications of
the concepts covered in the section.

~~~
zelda_1
Yes, it is very brief. I tried to give some useful reference
books/papers/links for each subject appeared in the book. Hope that is useful
if the readers are willing to dive deeper. But the concept of linear
programming itself is not complex, the users are generally not required to
understand simplex/interior point algorithm. If they are able to translate the
actual problem into the code then it is basically done. It is not like
constrained optimization for which customized treatment based on mathematical
theories is more important than writing the code.

------
euler_angles
What are the good intermediate level data science books?

------
adamnemecek
Julia is hands down much better than either of these languages. Don't waste
your time.

~~~
mruts
I whole heartedly agree. Python is garbage for data science. If an industrial
grade NN library was written for it, plus some quant libraries, I think most
people would switch.

I work in finance doing data science-y things and have yet to meet anyone who
doesn’t think that Python is a pile of garbage.

People used to make the easy to learn argument, but Julia is even easier. And
more elegant, extensible, and faster.

~~~
huac
You, uh, don't like PyTorch and TensorFlow? I can't tell if this is sarcastic.

~~~
pjmlp
They are written in C++.

~~~
randomvectors
Not sure how that's relevant. Everything is written in something else.

Python itself is written in C. The Julia github repo shows Julia 68.2%, C
16.3%, C++ 10.4%, Scheme 3.2%. R is a mix of C, C++, R and some Fortran I
think...

~~~
pjmlp
It is relevant from the point of view of what a developer is able to achieve
without being forced to drop down to a 2nd programming language, aka "2
language syndrome".

And how many Python libraries are just plain wrappers, not really written in
Python.

I use TensorFlow from .NET ML and C++ API.

------
omnimkar69
I also skimmed through the book and it does very poor work in the case of
python I think that they should change R bcoz 90% are the beginners

