

Ask HN :How to become a low-end big data consultant? - noobplusplus

I want to earn some money for living.<p>Big Data is booming, and I want to strike the rod while it is hot. Just want to make some money, doing low end consultancy.<p>I have 2 years of professional python development experience, and freelancing with django&#x2F;flask.<p>Big Data will fetch more $s per hour. So I want to shift to big data. Could someone please throw in pointers, as to where should I start, so that at the end of say 2 months, I start doing some low end freelancing on big data plucking some low hanging fruits.<p>I am not looking for shortcuts, but then I do not want to be researcher. My goals are simple, learn which shall be able to fetch me revenue.<p>Let me know what and how much, and I shall put in efforts.<p>It will be good of people to tell me the areas I should focus on, maybe start off with a problem itself, and learn whatever it takes to solve the problem, thus acquiring the skillsets, and using those to charge clients.<p>Also where would I find clients?<p>Please let me know, thanks.
======
uberalex
I would suggest the following topics (forgive me if this is a bit
disorganised)

Languages:

    
    
        R, Python (pandas, numpy), C++ or Java, Matlab/Octave
    

Stats & Machine Learning Topics:

    
    
        Neural Nets,Decision Trees,SVM,
    
        Regression (Linear, Nonlinear, Logistic)
    
        Clustering (K-means, Fuzzy C-means, Mixture Modelling, etc.)
    
        Time-series Modelling/Prediction (AR, ARMA, ARIMA, Exponential Smoothing)
    
        Bayesian Techniques
    
        Statistical Model Development
    

System experience:

    
    
        Linux administration
    
        Big Data Libraries (hadoop at a minimum, Mahout, Hive, 

Storm, Yarn as well)

    
    
        Data cleaning and large scale data management

------
rpedela
You already know Python which is great. It has become the dominant language
among data scientists. Most of the best data science tools are in Python or R,
and R can be easily integrated into Python.

The best way, in my opinion, to start is taking random, public data sets and
converting them to a form that is useful using Python. For example, take a
HTML page with a spreadsheet on it.

1\. Extract the spreadsheet from the HTML page. 2\. Convert that spreadsheet
to a CSV format. 2a. Reorganize the columns (optional). 2b. Clean the data
such as fix misspellings, reformat data such as an address to a standard
format, etc. 3\. Import the CSV file into a relational database. 4\. The data
set is now ready for use.

You could make 2a,b to be 3a,b instead if it is easier for you. The above is
what I would consider "low-end". If you cannot do the above, then statistics,
R, Tableau, Hadoop, etc will not help you much.

It turns out I am looking for people do this. You can check out my company
([https://www.datalanche.com](https://www.datalanche.com)) and message me
through the feedback form if you are interested. The job is to do the above
for several public, healthcare data sets.

~~~
lifeisstillgood
That sounds almost embarrassingly low end.

I have heard plenty of big data, but according to you ETL still seems to be
the biggest problem (closely followed by unsafe assumptions)

~~~
rpedela
I am curious, what are my "unsafe assumptions"? I would love more detailed
feedback.

ETL is not the biggest problem, but it is very time consuming, tedious, and
costly because the tools overall suck. That is why I started Datalanche
although to be fair we so far only solve part of the problem (storage, query,
sharing).

Anyway the OP wants to learn how to do big data and make money. Well even if
you are doing the cool stuff such as analytics, you still need to know ETL
basics. I offered to give some experience doing that.

~~~
lifeisstillgood
Not your unsafe assumptions but in stats almost every blind alley and wrong
turn is an assumption that bit back. Assumed that the data we have is a
representative sample of all purchasers etc etc

I have no real knowledge of the big data market so am happy to defer. I assume
you are taking data sets covering g them same thing in hundreds or thousands
of hospitals nationwide and outputting nicely mapped nationwide data?

Please do give as much info on the market as you can - always interested in
hearing from on the ground "troops"

~~~
rpedela
Just to give you a more concrete example.

The CDC has made available two datasets which are of de-identified medical
records going back 20 years which are updated yearly. Unforunately it is
extremely difficult to find the data. In fact, I went to the correct web page
and missed it. One of my colleagues found it instead. The data is in a custom
text format which needs a custom parser, and the format changes slightly every
year. Then codes and abbreviations are used everywhere in the data set which
need to be determined and converted. Then you have the usual misspellings, etc
that are common in government data sets.

It is truly a pain to use these CDC data sets even though the data itself is
great.

~~~
lifeisstillgood
Can you answer why it's so bad?

I can guess at a lot of answers but which particular strain of corporate
pathology CDC suffers from

~~~
rpedela
I don't really know. I have an educated guess for the custom format though.
They started collecting data before XML, JSON, etc were widely used and/or
existed which explains the custom text format. Since their systems already
handle their custom format, there is no real incentive to change it.

------
ishbits
I'm not sure if I'm correct or not, but I see 2 types of big data guys. First
there are the guys who know the tools - perhaps this is the "data engineer"
referenced by rdouble. This guy would know all the tools, how to map data in
and out of say something like Cassandra or HBase?

Then there are the guys who can properly analyze the data and turn that into
something that can save an enterprise money. This is probably where the big
bucks come in.

At my weekly codeandcoffee gathering we are discussing some problems we think
can be solved by big data analysis. The actual underlying technology is merely
an implementation detail, the value will come from the results. Our goal is to
save a target enterprise 1% a year, and given our target customer is a $40
billion company, that is a lot of money we can charge. We're just discussing
what ifs at this point.

------
rdouble
With your current skills you'd get likely get stuck in a "data engineer" role
which honestly kind of sucks. IMO you are better off buckling down to learn
statistics and then some front end analysis tools like R and Tableau.
Unfortunately there is not really a shortcut for learning stats.

~~~
noobplusplus
I am not looking for shortcuts, but then I do not want to be researcher. My
goals are simple, learn which shall be able to fetch me revenue.

Let me know what and how much, and I shall put in efforts.

~~~
rdouble
In my experience, even though data science is touted as a hot field, the
reality is that it's kind of boring and doesn't pay more than other types of
programming. It's hard to give specific advice because I think you're
operating on incorrect assumptions.

~~~
noobplusplus
what according to you are the incorrect assumptions which I am operating on.
My goal is to make somewhere around 60 - 70 USD/hr. and am willing to invest
2-3 months of learning.

And I already have 2 years of professional programming experience in Python.

~~~
rdouble
The incorrect assumption is that I don't think you'll necessarily make more
money by being a data science consultant rather than a django/flask developer.

I don't know where you live but where I live you can make 60-70 USD/hr with
the skills you already have. Like, make json API server for a mobile app with
Flask, that pays anywhere from $50-150/hr in my experience. As someone else
mentioned, what you get paid has more to do with where you live and who you
work for.

------
pukka_my
Honestly, you're unlikely to get your hands on any data sets (or paychecks)
that could really be considered 'big' without (1) an understanding of how to
translate research questions into code & (2) the ability to tell your client
why the results matter in any way. Why not just do freelance dev work?

------
mblakele
[http://blog.zipfianacademy.com/post/46864003608/a-practical-...](http://blog.zipfianacademy.com/post/46864003608/a-practical-
intro-to-data-science) might help with the skillset. When you feel ready,
update your linkedin profile to match.

------
morganwilde
I don't know if that's the right or wrong way of seeking out what to do -
searching for "low hanging fruit" to pluck. By definition, they are easy to
reach, thus by the time you become aware of such opportunities, usually,
you're too late.

------
ig1
Where are you based ?

Generally where you're located has a much bigger impact on your earnings and
potential clients than your skill-set.

~~~
noobplusplus
the land of "outsourced work" \- India.

