Hacker News new | comments | show | ask | jobs | submit | mg_'s comments login

It is not work I'm currently doing, but it is a data set I have access to which could be used for research into the topic :)

I guess sales numbers for a particular event is too connected to the popularity of the performer, which I presume would need a totally different data set to start predicting.

Sales outlets and their geographical locations are interesting however, thank you for your suggestions!

I still feel a bit in the dark when it comes to starting off though, do you have any suggestions on literature or tutorials to get things rolling? Any (programming) language works fine :)


I haven't done much with geographic data myself but I know Google has an API for plotting that kind of data :


Beyond that R is a very powerful tool for data analysis. I would recommend installing RStudio which provides a significantly nicer interface and is available for all major platforms.

In R it's easy to load a CSV data file, look at the data and subset it. It also has good tools for plotting. For exploratory analysis scatter plot matrices are often a good place to start. It's also easy to do linear regression with the lm command.

Beyond that almost every statistical and machine learning model you're likely to have heard of has an R implementation but the documentation is often not easy to read and you may find yourself needing to read research papers just to understand how to use some of the models.

In general R is extremely powerful but it has a steep learning curve. There are quite a number of websites now that have good tutorials for the basics though.

I don't know how large your datasets are. R may have problems dealing with extremely large datasets, though that partly depends on how much memory you have. If you have millions of rows or more you may want to randomly subsample your data so you can still use R to do some preliminary exploration on it but you may need other tools for building your final models.

Another site that may be helpful is http://kaggle.com. I believe they have published a number of writeups on modelling methods by winners of some of the data mining competitions they host.


Applications are open for YC Summer 2016

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | DMCA | Apply to YC | Contact