

Ask HN: Help me choose a data science research project - csdrane

I&#x27;m mulling over the idea of working my way through a data science text and self-teaching. I&#x27;d probably create a blog to document my efforts and to serve as notes to myself. I think I&#x27;d learn more and have more fun if I had a research project that I could work through as I learn.<p>I&#x27;m very much interested in finance and economics. Additionally, professionally I work in commercial real estate. However, I don&#x27;t know how well these subjects would lend themselves to research projects. Generally trying to predict the markets is a fool&#x27;s game. So I&#x27;m wondering what unexplored, worthwhile areas of research might exist. I&#x27;m reaching out to the HN community to see if you guys have any interesting ideas. Thanks!
======
j2h6mW
The hardest part of my practical coursework in statistics was picking good,
free data sets for final projects. Pick something awesome, your final
presentation will be awesome; pick something lame, and your final project gets
you an A in Dejected Foot-Shuffling 101. If anything, the best data sets
_weren 't_ from the sexy, unexplored fields. Remember how everyone tells you
to pick classes by the professor, not the subject? It's a similar counter-
intuitive thing for data sets. Find a data set that's rich and complete, and
even if it's not a topic you're interested in now, you'll secretly love it by
the end of the term.

Enough sermonizing. Here's a list of data set ideas that served me well in my
youth:

1\. R comes with a lot of built-in data sets. Open up R and run the command
"data()" to see the list. Many R packages come with additional data sets (I
like the diamonds one from ggplot2). All these built-in data sets are sort of
small and not really project-worthy, but they're nice if you're just playing
around with new techniques.

2\. Government agencies release large, interesting data sets. Weather, census
reports, travel statistics, public health data... The only problem is that
they're usually a pain to query. Think outside your own country. And get ready
for spatial stuff.

3\. Academic institutes release pretty neat data, too. Natural science stuff,
geology stuff... Again, here comes spatial data analysis.

4\. Data journalists sometimes publish their data along with the story, and
usually, they haven't found _nearly_ all the cool stuff in there yet. This,
for instance, looks insanely fun: [http://project.wnyc.org/dogs-of-
nyc/](http://project.wnyc.org/dogs-of-nyc/)

5\. Sports data is free like tap water, terrifyingly detailed, and deeply cool
indeed.

6\. Natural language processing. Check out Project Guterberg! I like these
analysis projects...
[http://lotrproject.com/statistics/books/](http://lotrproject.com/statistics/books/),
[http://bost.ocks.org/mike/miserables/](http://bost.ocks.org/mike/miserables/)

7\. Make your own data! Do you have a pedometer? Records of what temperature
your house is? Some bloggers in the "Quantified Self" movement seem awfully
cavalier about their own privacy, but they have undeniably boffo data.

8\. And finally: commercial real estate?! There has got to be _so_ much
interesting data to work with there. I know you don't think you can predict
the markets, but at the very least you could make pretty maps and pictures.
Maybe your company will let you play with some data, provided you show them
your insights? Don't know if they'd let you blog it all over town, though...

Congrats, my friend, you are one of us now. The people who drool over CSV
files.

------
tmoullet
I know that there are a few municipalities in the U.S. that have made some of
their governmental records available via API. Off the top of my head, I'm not
sure which ones, but there is likely a boat load of under analyzed data there.
Similarly, the Census Bureau has a lot of large data sets.

Which book are you going to be studying?

~~~
csdrane
I haven't decided on one yet although I did come across this HN thread
[https://news.ycombinator.com/item?id=4973450](https://news.ycombinator.com/item?id=4973450).

Do you have any recommendations?

~~~
tmoullet
No, I don't. Sorry. Thanks for the link. I'm actually looking to start
learning some advanced stats also, so I'm on the lookout for resources.

------
ig1
Have you looked at Kaggle ? - it's a good place to learn as you can benchmark
yourself against others and after a contest has closed there's normally a fair
amount of post-game analysis as people share approaches.

------
agibsonccc
Browse through here:
[http://archive.ics.uci.edu/ml/](http://archive.ics.uci.edu/ml/)

You might be able to pick up a few things you want to predict based on the
datasets here.

------
stocktradr
Are you looking for a programming challenge or something? I've got an idea off
the top of my head that I'd love to share but don't know what kind of
experience you have/looking for.

~~~
csdrane
A few things: to learn about ML methods, to improve my coding chops a bit, and
to add to my portfolio if I'm ultimately happy with how it turns out.

~~~
stocktradr
I've had a theory for a while. Can you use technical stock indicators (such as
bollinger bands/SMAs) on natural data sets such as crime?

The theory would say yes since the formulas used aren't specific to the market
- they're just data. Its riddled me for quite a time and I think it could be
valuable if true.

This would test your coding, API, mathematics, and economics skills for sure!
Let me know if you're more interested in the idea.

