Hacker News new | comments | show | ask | jobs | submit login
Ask HN: Where to begin in data analysis?
9 points by mg_ 1687 days ago | hide | past | web | 3 comments | favorite
Big Data has buzzed around for quite some time now, and every startup and enterprise alike is looking for Data Scientists to come work for them.

I have never really looked beyond the buzz for techniques for my own work, but (I'll admit it) stuff like what FiveThirtyEight is doing has piqued my interest, and I've realized I need to learn some tricks.

Where do I begin? I've taken some statistics classes, but as a CS student I have always prioritized different subjects, so I never felt that I learned enough.

To take a more practical example, because I'm still trying to wrap my head around the actual benefits I can extract from large amounts of data:

Let's say I sell tickets to events, and I have a log of every ticket sold. Where it was sold (physical location or website), all info about the event, whether it was refunded, and so on. Pretty much all info you'd need about a ticket. In what concrete ways could I extract value from this data?

I'd wager I'm not the only one here on HN that would really like some pointers on how to get started in this subject!

With enough data you could build a model to predict, with some level of accuracy, the total number of tickets that will end up being sold for a given event as well as the likely time and geographical distribution of those sales.

In order to use such a model to increase sales you would need to be able to control some of the variables.

For example the model might tell you that there's a relation between the number of sales outlets in particular geographic locations and that some locations appear to underserved. Adding sales outlets in those areas might lead to increased sales.

For online sales having information about the expected time profile of sales could be used to more efficiently provision servers.

Is this just a hypothetical example or is this something that's actually related to work you're currently doing ?

It is not work I'm currently doing, but it is a data set I have access to which could be used for research into the topic :)

I guess sales numbers for a particular event is too connected to the popularity of the performer, which I presume would need a totally different data set to start predicting.

Sales outlets and their geographical locations are interesting however, thank you for your suggestions!

I still feel a bit in the dark when it comes to starting off though, do you have any suggestions on literature or tutorials to get things rolling? Any (programming) language works fine :)

I haven't done much with geographic data myself but I know Google has an API for plotting that kind of data :


Beyond that R is a very powerful tool for data analysis. I would recommend installing RStudio which provides a significantly nicer interface and is available for all major platforms.

In R it's easy to load a CSV data file, look at the data and subset it. It also has good tools for plotting. For exploratory analysis scatter plot matrices are often a good place to start. It's also easy to do linear regression with the lm command.

Beyond that almost every statistical and machine learning model you're likely to have heard of has an R implementation but the documentation is often not easy to read and you may find yourself needing to read research papers just to understand how to use some of the models.

In general R is extremely powerful but it has a steep learning curve. There are quite a number of websites now that have good tutorials for the basics though.

I don't know how large your datasets are. R may have problems dealing with extremely large datasets, though that partly depends on how much memory you have. If you have millions of rows or more you may want to randomly subsample your data so you can still use R to do some preliminary exploration on it but you may need other tools for building your final models.

Another site that may be helpful is http://kaggle.com. I believe they have published a number of writeups on modelling methods by winners of some of the data mining competitions they host.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact