
Ask HN: Data analysis workflow? - tucaz
What kind of workflow do you employ when designing a data-flow or analyzing data?<p>Let me give a concrete example. For the past year, I have been selling stuff on the interwebs through two payment processors one of them being PayPal.<p>The selling process was put together with a bunch of SaaS hooking everything together through webhooks and notifications.<p>Now I need to step it that control and produce a proper flow to handle sign up, subscription and payment.<p>Before doing that I&#x27;m analyzing and trying to conciliate all transactions to make sure the books are OK and nothing went unseen. There lies the problem. I have data coming from different sources such as databases, excel files, CSV exports and some JSON files.<p>At first, I started dealing with it by having all the data in CSV files and trying to make sense of them using code and running queries within the code.<p>As I found holes in the data I had to dig up more data from different sources and it became a pain to continue with code. I now imported everything into Postgres and have been &quot;debugging&quot; with SQL.<p>As I advanced through the process I had to generate a lot of routines to collect and match data. I also have to keep all the data files around and organized which is very hard to do because I&#x27;m all over the place trying to find where the problem is.<p>How do you handle with it? What kind of workflow? Any best practices or recommendations from people who do this for a living?
======
westurner
Pachyderm may be basically what you're looking for. It does data version
control with/for language-agnostic pipelines that don't need to always redo
the ETL phase. [https://www.pachyderm.io](https://www.pachyderm.io)

Dask-ML works with {scikit-learn, xgboost, tensorflow, TPOT,}. ETL is your
responsibility. Loading things into parquet format affords a lot of
flexibility in terms of (non-SQL) datastores or just efficiently packed files
on disk that need to be paged into/over in RAM.
[http://ml.dask.org/examples/scale-scikit-
learn.html](http://ml.dask.org/examples/scale-scikit-learn.html)

Sklearn.pipeline.Pipeline API: {fit(), transform(), predict(), score(),}
[https://scikit-
learn.org/stable/modules/generated/sklearn.pi...](https://scikit-
learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html#sklearn.pipeline.Pipeline)

[https://docs.featuretools.com](https://docs.featuretools.com) can also
minimize ad-hoc boilerplate ETL / feature engineering :

> _Featuretools is a framework to perform automated feature engineering. It
> excels at transforming temporal and relational datasets into feature
> matrices for machine learning._

The PLoS 10 Simple Rules papers distill a number of best practices:

"Ten Simple Rules for Reproducible Computational Research"
[http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fj...](http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003285)

“Ten Simple Rules for Creating a Good Data Management Plan”
[http://journals.plos.org/ploscompbiol/article?id=10.1371/jou...](http://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004525)

In terms of the scientific method, a null hypothesis like "there is no
significant relation between the [independent and dependent] variables" may be
dangerously unprofessional p-hacking and data dredging; and may result in an
overfit model that seems to predict or classify the training and test data
(when split with e.g. sklearn.model_selection.train_test_split and a given
random seed).

One of these days (in the happy new year!) I'll get around to updating these
notes with the aforementioned tools and docs:
[https://wrdrd.github.io/docs/consulting/data-
science#scienti...](https://wrdrd.github.io/docs/consulting/data-
science#scientific-method)

IDK what [https://kaggle.com/learn](https://kaggle.com/learn) has specifically
in terms of analysis workflow? Their docker containers have very many tools
configured in a reproducible way: [https://github.com/Kaggle/docker-
python/blob/master/Dockerfi...](https://github.com/Kaggle/docker-
python/blob/master/Dockerfile)

