Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Ask HN: Best “Big Data” Course?
28 points by ubertoop on Dec 22, 2020 | hide | past | favorite | 7 comments
I have access to a huge data set (english words, numbers) and would love to build a business around it.

I'm definitely putting the cart before the horse though. I need to understand what's possible via "data science" but I know nothing.

Where should I start? I'm a self taught software engineer by trade (full stack, mobile, and firmware).



I strongly recommend Georgia Tech's ISYE6501X course, "Introduction to Analytics Modeling" - available at edX.org[0] The professor, Joel Sokol, is a masterful lecturer with excellent pacing and a great intuition for guessing what the audience will be thinking at any given point in a lecture, and quickly addressing those thoughts.

The curriculum of the course covers foundational knowledge requisite for machine learning, data analytics, big data. The course does a fantastic job of breaking down how to select the most appropriate method for any given situation you might run into, and how to critically compare the results of various methods to see which worked best.

Personally I feel this material should be included in every STEM major's undergraduate coursework. And I'm usually displeased with the quality of online MOOC's; this particular course truly stands out.

The actual topics discussed include:

Support vector machines (SVM), classification, clustering, principal component analysis (PCA), Bayesian modeling, exponential smoothing (ARIMA, GARCH), decision trees, Markov chains, k-mean, k-nn, Q-Q plots, probability distributions, and graph analysis. Software used included R, Python, and Rockwell’s Arena Simulation Software.

0: https://www.edx.org/course/introduction-to-analytics-modelin...


Thanks. Signed up, looking forward to Jan 21!


>I'm definitely putting the cart before the horse though

Yes, but you're a software engineer by trade so you understand this is akin to someone saying: "I have access to a programming language and would love to build a business around it". You're ahead in the sense you're aware of it.

Usually, we sit down with clients and work with them to extracts problems they may have, and keep them talking about problems because they want to escape to shiny/exciting/buzzwordy things "chatbots dashboards NLP". We drill down until we extract problems, not solutions. We then go through our usual processes.

One way to look at it with your background would be the following: what would be close to impossible to write a program for but that you do easily as a human. Make a list. How much they're costing you (cognitive, time, money)? How frequently are they happening? This should get you started.

You do something similar but opposite with programming: you think of something that you can describe in clear steps that would be easy for a computer to do tirelessly and fast.

One other way to look at it is to open your eyes to the problems you have at work. If you work at an organization, apply the same thing. How much are these costing you? I'm asking you to think about selling to enterprise because I'm biased given that this is what I do, and because you want to build a business. Catering to enterprise would be, in my opinion, more interesting from a business standpoint.

What is the source of that data? Is it about a specific domain/industry/sector/function? Who would find it valuable?


I'm slightly disappointed that the term "data mining" fell out of favour, because I think mining helps convey the idea that you have to sift through a lot of worthless crap before maybe, if you're lucky, getting something valuable out of your resource. This is not at all intended as snark - it's just that (as you already know) the huge data set might not have enough value in it to start a business.

Regardless of commercial value, it could be a very nice resource to experiment with NLP methods - simple TF-IDF for classification models, topic modelling for unsupervised learning, training/fine-tuning your own BERT model, etc.

To actually answer the question, the best "Big Data" (very much distinct from "Data Science") course I ever did is a now totally outdated course by Cloudera that went through Hadoop in great detail - mappers, reducers, shuffle-and-sort, the works. It really helped me understand what was going on under the hood when I ended up using Hive and then Spark a few years later. It might have merged into Developer Training for Spark and Hadoop"[0], though I'm not completely sure about that.

[0] https://www.cloudera.com/about/training/courses/developer-tr...



What business do you want to build? What makes you think that the dataset that you have is useful? Is the dataset in tabular form or in free text?


Chris Mattman has a good series online




Consider applying for YC's Fall 2025 batch! Applications are open till Aug 4

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: