
Python or Clojure for Data Analysis? - s_c_r
I’m starting a new role that will be heavy on data analysis. In the past I’ve focused on web dev&#x2F;CRUD apps using Go, JS, PHP, other web stuff. All that to say, this will be new territory for me. Conventional wisdom seems to be that Python is the tool of choice for data science type roles, along with pandas&#x2F;numpy and Jupyter notebooks. That or maybe R or Julia. On the other hand I hear about how Clojure is great for data munging type tasks. I have no experience with lisps but I’m interested in dipping my toe in the water if it will be a good fit for what I’m going to be doing. Which one in your experience will pay off more in the long run? Python with its tooling and community or Clojure’s language design?
======
daslu
Best wishes for the new role!

Eventually, it may be a good idea to try both Clojure and Python.

Personally I find Clojure's approach towards data very refreshing. It does
require an open mind and a mindset different than usual. Eventually, this can
bring joy, simplicity and power.

This article by Chris Nuernberger nicely explains what it is about:
[https://cljdoc.org/d/cnuernber/libpython-clj/1.2/doc/so-
many...](https://cljdoc.org/d/cnuernber/libpython-clj/1.2/doc/so-many-
parenthesis-)

Clojure's community is certainly smaller than Python's, but some say it is
very friendly.

Below are some beginner-friendly places to chat about it. If you wish, let us
chat there, dive into the details, and think how you could begin exploring.

Clojurians Zulip
[https://clojurians.zulipchat.com](https://clojurians.zulipchat.com) and
especially the data-science stream:
[https://clojurians.zulipchat.com/#narrow/stream/151924-data-...](https://clojurians.zulipchat.com/#narrow/stream/151924-data-
science)

Clojureverse [https://clojureverse.org](https://clojureverse.org)

~~~
s_c_r
Thanks for the links! I like the idea of a changed mindset. That sort of thing
can be useful in and of itself as an educational exercise.

~~~
daslu
By the way, we are having a Clojure data science public meeting at the end of
this month:
[https://twitter.com/scicloj/status/1291845872884625408](https://twitter.com/scicloj/status/1291845872884625408)

We will assume some basic knowledge of Clojure, but I guess it may be
interesting to join anyway.

~~~
s_c_r
Interesting! Appreciate the info

------
Jugurtha
Hi,

Congratulations on your new role. Are you joining a team, or are you the team?
If you're joining a team, then you'll probably use what they're using and
learn their tooling before you could endeavor to improve it.

You're doing it in a professional context, so it will be Python. Many blog
posts and articles on popular medium websites address shiny new things, but
most of these posts address one of two scenarios: portfolio/toy projects, a
project with one individual working on it, a project with data that fits on
disk _and_ RAM, and/or a Kaggle project where a good part of the heavy lifting
has been done for you (data acquisition, cleaning, feature engineering, metric
identification) which never happens in real life because that's what you're
hired for in the first place.

A big problem in this field is the fragmented tooling and experience, which
means you have to weave tools together, unless the team you're joining has it
figured out and have internal tooling dialed in. Python dominates. I'm sure
other languages are used at other ML shops (we have used Scala in some of our
projects) but I think in your situation, there's no need to complicate things.

Then again, that is just an opinion. It is not the _right_ answer. The goal is
to deliver value.

All the best,

~~~
s_c_r
Thanks for the insight! I am the team--everything before was done quite
manually with spreadsheets. I already work in the organization but have been
promoted out of development into this new role so I'm blessed to have a large
amount of flexibility. It does seem like Python is the right tool for the job.
Your perspective confirms my hunch.

~~~
s1t5
> I am the team--everything before was done quite manually with spreadsheets.

In that case I would pick between Python and R. R might even win out slightly
over Python for your use case. Definitely not Clojure, Scala or even Julia.

------
nikonyrh
I have used both professionally at a senior data scientist role so I feel like
pitching in. Perhaps due to my background coming from Matlab I never got too
keen on dataframes (be it Pandas or whatever Clojure has to offer). Instead I
use matrices for homogenous data or whatever hashmap-of-list-of-sets describes
more complex data. When your data is already in a CSV format and you want to
do basic analysis on that or fit mathematical models I highly recommend the
Python / Numpy / Pandas / Scipy combination. It can be easily extended to
which ever direction you want to go, be it PySpark or Keras.

Clojure taught me a lot about infinite lazy sequences (kinda like Python's
generators) and how to model the program as a pipeline. A good analogy is
found from shell programming. There you have stand-alone programs which handle
individual tasks and you can pipe previous program's stdout into next
program's stdin. On Clojure you'd wrinte stand-alone functions which you
"pipe" together via "->" thread-first and "->>" thread-last macros. It also
ships with several handy functions such as "frequencies", "group-by" and
"partition-by". I have ported these and several others to my own Python
projects thanks to their versatility and a kind of universality.

Oh and speaking of macros, if you want to get fancy you can design your own
domain-specific-language and express your problem in that, hiding all of the
poilerplate under the hood. But to get the highest performance sometimes you
need to think whether to use Clojure's immutable datastructures or resort to
Java's mutable ones, which could have better performance (or use a library I
guess). Well at least on JVM you can do "real" parallel programming, unlike on
CPython interpreter due to the GIL.

Clojure is fun and very educative for all kinds of projects, but on a
professional data analysis setting I'd start with Python and if it seems like
a bad fit then do a PoC with Clojure. :)

What a huge topic.

------
whalesalad
If you want to learn something new and have cycles to burn on that: Clojure.
It’s a great language, but learning it is going to be a slower and more scenic
route.

If you want to get things done: Python. You’ll have no problem getting up to
speed based on your past experience, and the ecosystem is orders of magnitude
larger than Clojure.

------
auganov
As for Jupyter notebooks, REPL-driven development in Clojure gives you the
same ease (arguably better) of messing around with code while also scaling to
serious software dev. Though it's not as nice for sharing with others.

If you're working in an environment where there's a lot of collaboration
Clojure might be tough. But if you're actually going to be developing software
that relies on data analysis (rather than just doing it as a one off) I think
Clojure might be worth considering.

~~~
daslu
REPL-driven development is wonderful.

Clojure does have notebook solutions which are worth looking into: *
[https://github.com/clojupyter/clojupyter](https://github.com/clojupyter/clojupyter)
* [https://github.com/jsa-aerial/saite](https://github.com/jsa-aerial/saite) *
[https://pink-gorilla.github.io](https://pink-gorilla.github.io) (WIP, but is
growing fast and going to be magnificent)

Some other projects are trying to connect the notebook idea with the
REPL+editor experience. For example: *
[https://github.com/metasoarous/oz](https://github.com/metasoarous/oz)

------
dfah
Data analysis is alas a big field. I would say you should assess a few
factors: 1) how much basic-ish learning in the data science / statistics / ML
areas you expect to be undergoing yourself; the Python ecosystem will probably
make this much faster 2) how you expect to scale & productionize your analysis
tasks (if at all); in my experience Python is a second-class citizen in the
Spark world, not far above Clojure's third-class status, and throughput gains
from Clojure's native JVM output may outweigh the relative convenience of the
PySpark interface. TensorFlow & TFX's interfaces are basically designed from
the ground-up for Python. 3) Which major techniques & corresponding libraries
you expect to use (e.g. MCMC/STAN, Pyro, TensorFlow, scipy, scikit-learn).
Some of these might rule out one language (more likely eliminating Clojure) or
the other. 4) How important data visualization will be for you. This aspect of
the work will be much easier & richer in Python than in Clojure. 5) What kind
of data transformation & validation you expect to do. If this is largely
statistical in nature (e.g. rescaling distributions) it's probably a wash. If
this is viz-heavy it'd favor Python. If this involves complicated structured
data, I'd recommend Clojure.

------
aynyc
Why Clojure? I honestly never heard of anyone using Clojure as data tools.

I use Python and Scala. I use Python for mostly small tasks. When I hit large
data, I normally use Spark on EMR (PySpark or Scala).

~~~
hellofunk
> I honestly never heard of anyone using Clojure as data tools.

Clojure is one of the most actively-used data analysis languages actually. It
is used by many industry for that purpose. Heck even the widely-used Metabase
is written in Clojure.

~~~
tcbasche
I don’t believe that. What do you have to back that up? I would have thought
Python or Scala - been in the data analysis game for a few years now and have
never heard Clojure mentioned once except on Hackernews

------
aprdm
I would 100% go with Python, it has a much bigger tooling and community which
makes easier to ask question and to collaborate with people.

Once you're comfortable with it, then it might be worth exploring other
languages that are less known to have a (subjective) better software design.

