

Ask HN: Where to begin in data analysis? - mg_

Big Data has buzzed around for quite some time now, and every startup and enterprise alike is looking for Data Scientists to come work for them.<p>I have never really looked beyond the buzz for techniques for my own work, but (I'll admit it) stuff like what FiveThirtyEight is doing has piqued my interest, and I've realized I need to learn some tricks.<p>Where do I begin? I've taken some statistics classes, but as a CS student I have always prioritized different subjects, so I never felt that I learned enough.<p>To take a more practical example, because I'm still trying to wrap my head around the actual benefits I can extract from large amounts of data:<p>Let's say I sell tickets to events, and I have a log of every ticket sold.
Where it was sold (physical location or website), all info about the event, whether it was refunded, and so on. Pretty much all info you'd need about a ticket. In what concrete ways could I extract value from this data?<p>I'd wager I'm not the only one here on HN that would really like some pointers on how to get started in this subject!
======
tgflynn
With enough data you could build a model to predict, with some level of
accuracy, the total number of tickets that will end up being sold for a given
event as well as the likely time and geographical distribution of those sales.

In order to use such a model to increase sales you would need to be able to
control some of the variables.

For example the model might tell you that there's a relation between the
number of sales outlets in particular geographic locations and that some
locations appear to underserved. Adding sales outlets in those areas might
lead to increased sales.

For online sales having information about the expected time profile of sales
could be used to more efficiently provision servers.

Is this just a hypothetical example or is this something that's actually
related to work you're currently doing ?

~~~
mg_
It is not work I'm currently doing, but it is a data set I have access to
which could be used for research into the topic :)

I guess sales numbers for a particular event is too connected to the
popularity of the performer, which I presume would need a totally different
data set to start predicting.

Sales outlets and their geographical locations are interesting however, thank
you for your suggestions!

I still feel a bit in the dark when it comes to starting off though, do you
have any suggestions on literature or tutorials to get things rolling? Any
(programming) language works fine :)

~~~
tgflynn
I haven't done much with geographic data myself but I know Google has an API
for plotting that kind of data :

[http://gislounge.com/how-to-import-data-make-maps-google-
fus...](http://gislounge.com/how-to-import-data-make-maps-google-fusion-
tables/)

Beyond that R is a very powerful tool for data analysis. I would recommend
installing RStudio which provides a significantly nicer interface and is
available for all major platforms.

In R it's easy to load a CSV data file, look at the data and subset it. It
also has good tools for plotting. For exploratory analysis scatter plot
matrices are often a good place to start. It's also easy to do linear
regression with the lm command.

Beyond that almost every statistical and machine learning model you're likely
to have heard of has an R implementation but the documentation is often not
easy to read and you may find yourself needing to read research papers just to
understand how to use some of the models.

In general R is extremely powerful but it has a steep learning curve. There
are quite a number of websites now that have good tutorials for the basics
though.

I don't know how large your datasets are. R may have problems dealing with
extremely large datasets, though that partly depends on how much memory you
have. If you have millions of rows or more you may want to randomly subsample
your data so you can still use R to do some preliminary exploration on it but
you may need other tools for building your final models.

Another site that may be helpful is <http://kaggle.com>. I believe they have
published a number of writeups on modelling methods by winners of some of the
data mining competitions they host.

