Hacker News new | past | comments | ask | show | jobs | submit login
Berkeley offers its data science course online for free (berkeley.edu)
801 points by seycombi on Apr 5, 2018 | hide | past | web | favorite | 158 comments

I find it curious that there are so many courses for data-science related subjects, which superficially seem to cover the same material, and relatively few courses covering more traditional CS topics such as computer systems, networks, OS. I suppose it has to do with the market, but also feels like colleges are skating to where the puck is, rather than where it will be (or perhaps, where it could be).

This is a general issue i have seen as well.

We live in 2018 and there is no open source course for everything. Instead there are probably 10-30k universities who have similar courses and professors who give the same lecture every year.

They get paid often enough by countries to create and do those courses. In germany most of our unversities are paid by all of us germans anyway.

And what do you find online? Always the starter verion like 101 computer science or videos with bad audio or video, no proper exercises, no solution helper etc. Nothing. You have to go to different sites to sometimes pay or sometimes not.

there are no local locations to meet up with people.

There should be a global initiative for global free and open access learning. Sponsored and supported by companies and countries. Build upon a core of a knowledge graph based on topics or 'snippets of knowledge'. Like for example: math -> add -> sub

Something like 'The Map of Mathematics' (https://www.youtube.com/watch?v=OmJ-4B-mS-Y)

And when you wanna get the global accepted math 101 level, you have to take specific topics / snippets.

And those snippets can than be filled with different people who are making a lecture for that topic and you can choose whom you like more or who is better in explaining it to you.

What do i do instead? I ask around for the lecture scripts because they are always behind a simple password protected area or have multiple links to different pages of different universietses who offer different courses for free as videos for there students in sometimes/often bad quality and / or bad video players etc.

It sucks and this is stupid.

>There should be a global initiative for global free and open access learning. Sponsored and supported by companies and countries. Build upon a core of a knowledge graph based on topics or 'snippets of knowledge'.

I like this idea and the framing of it a lot

Have you seen this? Does this meet your criteria/have enough sufficient resources/curriculum?

> http://datasciencemasters.org/

Nope, not at all.

I'm not looking for the next github page with collections of tons of different sites with different courses.

I still hope for one platform where all those smart people out there are working together to optimize learning.

I forget if its the Verge or some other popular podcast but they are always suggesting Apple just put together a fully open and accredited online university.

The problem is that there are a lot of people getting paid a lot of money all over the world to work in post-secondary education whom control the keys to accreditation and whom have proven very resilient at resisting any optimization efforts.

Apple isn't open. Their stuff has tended to not work on Linux.

I don't see why one would need Apple to do it. A consortium would be much better and more likely.

Apple is big enough to piss away one billion dollar and create a crazy good online University.

Also need done good pr I guess :)

Here is a list of CS courses including many graduate level courses:


Check out khanacademy.

I have also found this interesting. What I don't understand is that the amount of data science jobs are no where near the levels that people make it seem. I am not sure where all these people will end up working if they want to be a data scientist. There is not a need to hire huge teams of data scientists like you might for dev roles, it doesn't scale the same way.

Every job that involves data in any way is being relabeled a data science job. Most of them are just generating dashboards and posters in Excel or Tableau for people who are data illiterate. I know many people with maths/stats/comp-sci backgrounds who end up in these sorts of jobs.

“Just add a bunch of green up arrows and red down arrows, your manager will love it” was advice from a co-worker of mine. Sadly, she was right.

It's actually become somewhat hilarious to me. Like you said, the data science label is being applied quite liberally (no judgment, I'm not the world's authority on how it should be applied), so here you have companies paying $100k or more to have people do Excel work or Tableau visualizations.

"Data scientist" is the new "business analyst".

I think data science moved into the hole where analysis used to be.

This. You are exactly right. I work for a startup not in anyway related to Data Science or ML etc. We use Python here. My flatmates work for a big named DataScience company and most of the time its numpy and a bit of data visualisation.Pandas and Numpy to the rescue. I am like dude, I can do that in blink.

In an engineering class, when dealing with a problem about factories that produce widgets and consume resources/widgets from other machines, I made a complex Excel spreadsheet that was animated to pass little numbers around that represented the items produced or consumed.

It didn't actually give correct results, I'm not sure it could have worked (I needed to do two updates on the same cycle and never did figure out how), and I documented that it was buggy.

But it looked cool and I got good marks. So my experience pretty much agrees with yours.

Where can I get a job like that? I'd be happy to take it at this point.

Data engineering as well.

Haha, so much yes.

Every graduate that doesn't find a high paying job in the field is an excellent candidate for the next level of education.

After you have given a university hundreds of thousands of dollars and a decade-plus of your life, you will then be ready to teach the next crop of students.

> Every graduate that doesn't find a high paying job in the field is an excellent candidate for the next level of education.

1. You really think Berkeley's top-ranked PhD program is recruiting people who couldn't find jobs? No. Not only can 100% of successful top-tier PhD applicants find jobs, 100% of them are strong candidates for the top echelon on entry-level jobs.

If you disagree, go look up the people in Berkeley's CS PhD program. Point out a single person you think didn't turn down mid-100s job offers to attend Berkeley.

Getting into one of these top-5-to-10 PhD programs is no small thing...

2. MOOCs aren't PhD programs and in nearly all cases aren't designed to feed into PhD programs.

3. Finally, at least in CS, at least for the moment, educators are in extremely high demand. And again, at least for CS, that demand isn't being manufactured by the academy.

You really think Berkeley's top-ranked PhD program is recruiting people who couldn't find jobs?

It is well known that PhD programmes churn out far more PhDs than can reasonably be employed in their field.

I mean, including friends and former colleagues I probably know maybe 300 people with PhDs. Of those I can count on my fingers those actually doing research in academia, probably one hand those on tenure track. But do you really think any of them slogged through the programme in Physics or Biology just to get a software job writing CRUD apps, or prettying up BI reports?

You're arguing against a point that the OP didn't make.

See point 3. At least in CS, at least for the moment, there's insane demand for CS educators. Maybe not research track at top 20 R1, but definitely requires a PhD.

Close to 100%, but not 100%.

This is sort of technically true. Modified statement, based on extensive person experience: ~80+% had such an offer in hand, and the remaining 20% either:

a) didn't bother applying but would've been shoe-ins, or else

b) knew very early (freshman year) they were research-bound and optimized for a non-industry objective function (but could've skated into an industry job of their choice given a shift in undergraduate career focus). E.g., couldn't pass a coding interview and no industry internships but have one or more top-tier publication in a hot subfield.

But (b) is kind of stupid to think about. It's like saying a successful lawyer would not make captain in the military. This may or may not be the case, but either way, who cares?

Lawyers can commission directly as captains in the USAF (and probably other branches) as long as they are not too old and can pass the fitness test. I think they also have to have passed a bar exam in one state, but it doesn't matter which one. They can get up to $65k of student loans repaid and don't even have to go through the same basic training as everyone else.

So (b) is slightly less trivial than saying a successful lawyer would not make captain in the military because most lawyers are one conversation, a few signatures, and one oath away from being a captain.

So don't delay lawyers, join today!

Did you actually do any recruiting? Or are you like me and just sort of started inadvertently talking this way after a certain number of years?

No recruiting, I started off by wanting to point out that military lawyers don't "make" captain, they are given that rank to start. Then I realized it sounded like a little like a recruitment talk so I decided to go all in.

Yes, I know, that was my point.

> ...So (b) is slightly less trivial

I'd argue not. In both cases, a fully capable and prepared person has to jump through a couple of relatively trivial hoops. And when those hoops aren't possible to jump through it's a weird case. We can split that hair, but it's silly.

You chose a funny example. A lawyer can make captain in the military just by signing on the dotted line, same with a doctor.

It’s a pyramid sceme you can get in on with financing.

I disagree with this. Every enterprise company has an analytical department, even more so in the public sector. My municipality has 8 guys working on analytics for instance.

They are mostly economics or (I’m not sure what it’s called in English, but it’s a degree in societal administration), but they really ought to be data scientists because everything they do is based on huge sql data sets.

We pay private contractors a lot of money to turn our data into cubes and manageable models because none of our analytics know how.

In 10 years I suspect anyone with that job title will need data science on their resume. Not just to manage the data, but also to start doing machine learning on it.

By comparison we have one network guy to run the network for 10.000 employees and 5000 students, with a backup guy who knows everything the first guy does but works with something else, you know, in case the first guy quits.

My municipality has 8 guys working on analytics for instance.

Do you propose that they should take a MOOC like this one, or that they should be replaced by MOOC graduates?

I think they’ll be replaced with candidates that mix economics and data science as they naturally retire or go elsewhere. If I was young in the field I’d definitely take a degree in data science, especially if I worked in the public sector where we’re much more willing to pay for your education as well as give you time off to take it.

I actually think in order to do it right you need to have a sizable team (n > 7) of data scientists in order to keep them honest, productive, and developing their skills. They will need to pair up with a team of software devs/ML/data wranglers to help them push the edge on what they are doing. Depending on the company that could be the complete engineering team, or it could be a disjoint set.

Rebrand marketing into a data science'y name to attract quantitatively minded applicants (already starting to happen in some orgs)

As someone in data science this isn't a trend I particularly like. I suppose to some extent these jobs can be filtered out by requiring a salary that they're not willing to pay for that work.

I don’t think the colleges care where their students end up when they’re done with them.

They care to the extent that the alumni are willing and able to make fat donations.

Startups do not need a huge team of data scientists... but industry in general does need data analysts and this would certainly enhance the skills of that crowd.

MIT open courseware has a bunch of classes related to traditional CS topics. Also just searching universities you can find some. Data Science is the new hip class and it’s just very aggressively marketed. The other one being marketed heavily is intro to coding .

Their computer security class is amazing. All lectures, notes, labs, quizzes etc: https://ocw.mit.edu/courses/electrical-engineering-and-compu...

I think all MOOCs have traditional CS courses but marketing hype is on Data Science and ML. I remember doing Tim Roughgarden(StanFord) Algorithm course on Coursera. Loved it.

Good to know about OCW. Are there often videos as well as assignments? My issue in the past was that they often didn't have many of the teaching resources.

Intro to coding is a fascinating one. From a marketing/business standpoint it makes sense, but 3-4 it was extremely frustrating to see dozens of intro coding courses, but practically nothing for intermediate programmers. Thankfully we're past that point for the most part.

It depends on the class but they have videos and assignments there .

Because it sounds cool...I remember when I was applying for college (2001), nanotechnology and biomedical engineering was all the rage. Glad I stuck with electrical engineering.

I thought biomedical engineering still has a lot of potential to offer?

My friend who majored in biomed kept getting passed over during her job search. Turns out all the biomed companies just wanted to hire mechanical engineers. (She did eventually find a job in her field.)

I guess it's analogous to the data science degrees popping up today. Will be interesting to see if it ends up as a fad degree or a legitimate career path.

My impression (as someone who was once very interested in biomedical engineering) was that it was always mostly mechanical engineering plus some bio/chem and teaming up with doctors. I didn't end up going that route but I got an ME from a school where a number of mechanical engineering professors worked with the affiliated hospital on projects.

The people I knew that majored in biomedical engineering went on to med school... Probably one of those areas that is always in the news and science magazine but still too early to revolutionize life.

I remember when Dolly was cloned and we would have a whole new industry...

Evidently I am ignorant based on comments from people who are obviously better informed than me.

What is the field that is going to revolutionize curing diseases through genetics engineering etc?

Or bioinformatics ~2009?

I'm just curious -- where do you think the puck will be? I've had a number of younger acquaintances ask for career advice. Pursuing some kind of data science seems like an obviously smart direction now, but I've wondered if this, as well as traditional CS career paths, may be in danger of becoming over-saturated areas, now that everyone views them as sure paths to a job that pays well.

By its nature data science work is very undefined. There is not yet a widely established design path for data science as there is for software development. The risk for someone starting their career in data science is that they will end up in an organization that doesn't know how to data science. So I'd recommend steering young folk to larger teams that have PhD level Statisticians.

On the tech side front end or data science both seem like good options.

Companies are need a store front, so front end work will continue for a while, until it is super easy for any joe to make a professional website.

Data Science looks promising too, because it is automating and solving problems that previously could not be done.

However, outside of tech the world still needs skilled blue collar workers. For example, I don't see carpenters being automated away any time soon. I hear some of those jobs pay better than tech work too.

i think there is a natural barrier for these paths being too hard and too mathy for most people. people have known for decades that engineering is a good path.

Yep, a lot of the people who want to get into it run away when they find out you need to know how to code and understand the math. For some reason there is this idea that it is easy money.

I simply mean in terms of the availability of online courses. Eventually there will be harder CS courses online. While there's been enormous movement in this direction in even the last 2 years, there's still a lot of room for growth and improvement.

math is the secret weapon here. Cybersecurity, data science, programming

I think it’s because data science has a more immediate, broader applicability than computer science. Not everybody needs to know how to program a full application; but they should be able to load in a dataset and statistically analyze it. Looking at the type of people taking Data 8 compared to CS61A (the introductory programming course), I would say the former is a fairly diverse crowd (political science majors, economics majors, biology majors, etc.)

It is also a possibility that Data Science is an easier topic to learn than Computer Science, and thus more popular.

It might be more popular as many data science courses use Python and therefore students don't have to get their code to compile.

Seriously though, I think people are drawn to data science out of a desire to create stories with some underlying support (data/evidence) in order to influence policy or business decisions.

61A also uses Python.

It says fastest growing course. At my university it's also the fastest growing course. It grew from 0 to 50 students in the last semester.

You're talking only about the public + free MOOC stuff, right? I think it's reasonable for that to be biased toward less specialized stuff.

Internally, Berkeley definitely isn't based toward the intro-level stuff. Quite the opposite. But the most polished, rehearsed, mass-manufactured classes are certainly the gigantic intro-level ones everyone takes.

> I find it curious that there are so many courses for data-science related subjects

Is it that there are more courses in data science relative to other topics, or is just more marketing around these classes? It costs Berekely essentially nothing to pump out some press around the release of course materials in data science.

GaTech has a lot of good MOOC Computer System courses.

Yeah they seem to be one of the few that's really leading the way here.

All these courses in data science ain't gonna solve science biggest mystery, that is consciousness. Too bad we don't focus on that.

/Cognitive scientist

We are starting to see a bit of a pushback although it's hard to discern amongst all the hype. There's some recognition that deep learning is just one technique that happened to pop (to use Rodney Brooks' term) for a variety of reasons but that we haven't made huge progress in cognitive science and other fields.

Deep learning is the current shiny toy but I suspect we'll find it isn't actually sufficient for a lot of things we want to do and we'll run into a wall a lot of people aren't expecting.

That's why it's called data science not cognitive science. Duh!

Our computer security course at Berkeley regularly attracts 600+ students. Berkeley is doing fine in terms of traditional course offerings.

I meant in terms of online courses. Would love to see that class put online.

All the materials (except lecture videos) are available online; just google CS161 Berkeley.

People at Berkeley view this class as kind of a joke. The average grade is insanely high and the topics are covered in much less depth than just the normal intro cs or stats classes.


The instructors have explicitly said, if you have previous CS or Statistics knowledge, the class isn’t for you. This is for people who don’t know how to program and haven’t taken a statistics course yet.

None-the-less, IMO parent's observation is important for purchasers of MOOCs. It's true for a lot of the very popular MOOCs.

I code myself. I've watched it. Pretty good learning platform (never had a look at a MOOC before): I am impressed (but I will take a look at other MOOCs mentioned in this discussion and I will certainly be less impressed very soon). Finished the first week. Right now, I find it a good introduction for someone who has no knowledge of code and statistics. IMHO, the main advantage would be that such a person may learn what coding and statistical reasoning looks like. The main disadvantage would be that this person think he has learned enough and keeps not trying to learn more about those topics. What an enlightening free week for the people: first look at "Do you trust this computer?", then attend this MOOC "Data8.1x"?

Good to know. I looked at this Berkeley course (along with some private offerings like General Assembly) and got the feeling that they really weren't worth the investment for a guy with a Math degree, a CS minor, and programming experience going back to childhood.

But I think I'd like some kind of formal, credentialed program that would build on my existing linear algebra + software skills (and address the weaknesses in my statistical understanding that I know are there based on how I felt about my grasp on the related material for even the classes I passed)... and maybe isn't quite as big an investment as a full-fledged master's degree.

Anybody have any suggestions?

This is exactly what we built at Lambda School - our data science/machine learning program has math (linear algebra, calculus) and CS (python) as pre-requisites, and is designed to train the rest of the way. It can also be free until you get hired in field.

It is a big commitment - 6 months full-time or one year part-time.


The Georgia Tech online masters?

Depending on what stats you want to do, there are some pretty decent MOOCs. No one is going to claim that Daphne Koller's PGM course is weak in anyway for example[1].

[1] https://www.coursera.org/learn/probabilistic-graphical-model...

A graduate certificate? Not sure what you’d want it in, something like this Applied Statistics option? https://www.worldcampus.psu.edu/degrees-and-certificates/app...

I randomly started checking other CompSci courses on that site and they all had a similarly high average grade, in some cases even higher (A instead of a B+).

Which are the hard courses at Berkeley in CompSci using the site you linked?

The hardest CS undergraduate course offered is probably 189: Introduction to Machine Learning. Nonetheless that class has a B+ average; so I wouldn’t say difficulty of class correlates with a low grade average. There are others like 170 and 162 which you can check out.

If my memory holds, there’s a policy that class averages should be around 3.0-3.3 (B/B+).

Compilers (164) and digital system design (152) are the hardest classes. Other classes with high workload are systems (162) and graphics (184)

I'm a UC Berkeley alum. When I was there this was a course taken by humanity majors to learn some programming so that their Resume looks cooler. majority of STEM majors take CS 61A (SICP) or E7 (Programming in MATLAB). Just noting this as a context, this is not the class intended for CS majors; this : https://cs61a.org/ one is.

Or if you want to get experience with data science: Stats 134 (easy), CS188 (medium), CS189 (hard).

Meh, I disagree. CS 188 is not a very useful course. You can learn 188 material very quickly (i.e. reading Wikipedia for a couple hours) if you do well in CS170 and CS 189.

I think for data science: Stats 134, CS 189, EE 127, EE 126 are most useful (in this order). Of course, in order to do well in CS 189 you need to have a good background in probability which can either be CS 70/Math 55 (if very well understood), or Stats 134, or EE 126.

I personally thought Stats 134 was the hardest of those courses by far (though I took it under a visiting professor who was needlessly difficult). 188 was a breeze, and I believe the full course is offered for free on edx

CS 189 (with Sahai) (Machine Learning) was by far the hardest course I took in my life.

I think this is a bad trend. These university make basic courses free to gain popularity and then ask for big money for their real courses.

This is bad in two ways:

1) The people taking these courses do not learn much for the effort and time they spend. Also it gives them illusion that they know enough as they take course from big university.

2) Industry is already so confused in hiring, they hire by name. So even you take these courses and study in depth on your own you can't get hired. Someone more qualified can not get hired just because they can't pay 100k to get a degree in machine learning from one of these big university.

This is really a bad trend and we should spend time on real courses. Everyone knows that TV series are waste of time, these courses are like TV series. Stop watching them.

A course called "[Intro to] data science" should be taken about as seriously in hiring as "Intro to computer science", or "Intro to mechanical engineering". There's no reason these courses should bear any weight in hiring, and it's disingenuous to attempt to lead people to believe otherwise.

I was talking about people who study in depth on their own after the course.

What a bleak world you must live in if all TV counts as a waste of time. Does that extend to other forms of drama? Literature? Entertainment?

Even if you believe it's pointless it's pretty clear it's no something everyone else "knows".

Always boggles my mind with these "free" online courses that still stick to old method of "registering" for the class and then following a regimented schedule.

Seriously, just upload the lecture videos, put the homework online and textbook. Add a message board and you're golden.

Having ~7 moocs and 1 udacity nanodegree under the belt, here is my anecdata :

Before Coursera, i was never able to finish anything on MIT opencourseware. Free flow of information need too much commitment from my end to be digerable.

It was the structure given by

> "registering" for the class and then following a regimented schedule.

that i managed to start and finish. Disclaimer: I discovered Coursera after grad school

For those interested in a practical, hands-on course, we just released one at Kaggle https://www.kaggle.com/learn/overview

Berkeley and the UC schools are making major strides in online education, including edX participation and on-campus projects. If you're interested in Berkeley and data science, there's an online masters program too. (Disclosure: Berkeley is in my client roster). https://requestinfo.datascience.berkeley.edu

US$65K for tuition

LOL. As someone from third world where local currency is enormously devalued with respect to US$, I wonder why would anyone do this?

If you think that's bad look up Executive MBA programs. Corporations basically give a university a nearly six figure donation for their top executives to get a tax-free tuition benefit for a rubber-stamped MBA largely indistinguishable from a real MBA.

I know the above probably sounds like sour grapes, I don't have an MBA or any graduate degree, I just think the whole Exec. MBA thing is a total scam against the companies paying for them and a huge cash cow for universities.

Exec MBAs are not for top execs. "exec" is just the marketing tag. It's exactly the same as exec Masters that GATech and UW do to extract tuition reimbursement funds from big companies like Microsoft

These prices seem mostly targeted employers who offer tuition reimbursement.

Currently in this Data Science Program. Happy to answer any questions anyone has if they're considering it.

direct link: https://www.edx.org/professional-certificate/berkeleyx-found...

(There are two ways you can follow the course: Certificate Program is paid, but the AUDIT program is free)

Note that if you don't want to pay, you should click "View Courses" and click on 1 of the 3 courses in the Foundations of Data Science series.

Pursue the Program ( $357.30 USD - old: $397 )

Have anyone followed the curriculum suggested here?

> http://datasciencemasters.org/

Okay, here's a view of what appears to be part of the course:

We have a course (right a school application of stuff taught in school!) with two teachers, that is, two sections of the course, each section with its own teacher and its own students. At the end of the two courses, that is, the two sections, we want to compare the teachers. So we give the same test to all of the students from both courses.

Suppose one section had 20 students and the other one, 25 -- the point here is that we don't ask that the two numbers be equal; fine if they are equal, but we're not asking that they be.

So, there were 45 students. So, get a good random number generator and pick 20 students from the 45 and average their scores; also average the scores of the other 25; then take the difference of the two averages.

That was once. It was resampling. Now, do that 1000 times -- remember, we have a computer to do this for us. So, now we have 1000 differences. If you want, then, "live a little" and do that 2000 times. Or, for A students, do all the combinations of 45 students taken 20 at a time. Ah, heck, lets stick closer to being practical and stay with the 1000.

Now, presto, bingo, drum roll please, may I have the envelope with the actual difference in the actual averages of the actual scores in the two classes.

If that actual difference is out in a tail of the empirical distribution of the 1000 differences from the resamplings, then we have a choice to make:

(1) The two teachers did equally well but just by chance in the luck of the draw of the students one of the teachers seemed to do much better than the other one.

(2) The actual difference is so far out in the tail that we don't believe that the two teachers were equally good, reject the hypothesis that there was no difference, called the null hypothesis, and conclude that the teacher with the higher actual average was actually a better teacher.

Sure, it happened that the real reason was that one section of the course started at 7 AM and was over before the sun came up and the other section was at 11 AM when nearly everyone was awake. We like to f'get about such details! Or, sure, we might get criticized for a poorly controlled experiment.

This is also called a statistical hypothesis test or a two sample test. It is a distribution free test because we are making no assumptions about probability distributions of the student scores, etc. Since we are not assuming a probability distribution, we are not assuming a probability distribution with parameters and, thus, have a non-parametric test. Uh, an example of a probability distribution with parameters is the Gaussian where the parameters are mean and standard deviation.

Such tests go way back in statistics for the social sciences, e.g., educational statistics.

In more recent years, leaders in resampling include B. Efron and P. Diaconis, recently both at Stanford.

Why teach such stuff? Well, some parts of computer science are tweaking old multivariate statistics, especially regression analysis, and calling the results machine learning and/or artificial intelligence, putting out a lot of hype and getting a lot of attention, publicity, students, and maybe consulting gigs. Also the newsies get another source of shocking headlines to get eyeballs for the ad revenue -- write about AI and the old "take over the world ploy"!

So, maybe now some profs of applied statistics, what for a while was called mathematical sciences, etc., or other profs of applied math want to get in on the party. Maybe.

What can be done with resampling tests? I don't know that there is any significant market for such: Long ago I generalized such things to a curious multidimensional case and published the results in Information Sciences. The work was a big improvement on what we were doing in AI at IBM's Watson lab for zero day monitoring of high end server farms and networks. Still, I doubt that my paper has ever been applied.

One of the best areas for applied statistics is the testing of medical drugs. Maybe at times resampling plans have been useful there.

I have a conjecture that resampling plans are closely tied to the now classic result in mathematical statistics that order statistics are always sufficient statistics. Sufficient statistics is cute stuff, from the Radon-Nikodym theorem in measure theory and, in particular, from a 1940s paper of Halmos and Savage, then both at the University of Chicago. Some of the interest is that sample mean and sample variance are sufficient for Gaussian distributed data, and that means that, given such data, you can always do just as well in statistics with only the sample mean and sample variance and otherwise just throw away the data. IIRC E. Dynkin, student of Kolmogorov and Gel'fand, long at Cornell, has a paper that this result for the Gaussian is in a sense unstable: If the distribution is only approximately Gaussian, then the sufficiency claim does not hold.

Other applications of resampling, such applied math, etc. might be in US national security. E.g., maybe monitoring activities in North Korea and looking for significant changes ....

Maybe there would be applications in A/B testing in ad targeting, but I wouldn't hold my breath looking for a job offer to do such from a big ad firm.

For all I know, some Wall Street hedge fund or some Chicago commodities fund uses such statistics to look for significant changes in the markets or anomalies that might be exploited. I doubt it, but maybe! Once I showed my work in anomaly detection to some people at Morgan Stanley, back before the 2008 crash of The Big Short, and there was some interest for monitoring their many Sun workstations but no interest for trading!

Net, IMHO for such applied math: If can find a serious application, that is, a serious problem where such applied math gives a powerful, valuable solution, the first good or much better solution, with a good barrier to entry, and cheap, fast, and easy to bring on-line and monetize, then be a company founder and go for it. But I wouldn't look for venture funding for such a project before had revenue significant and growing rapidly and no longer needed equity funding!

Otherwise look for job offers (1) in US national security, (2) medical research, (3) wherever else. But don't hold breath while waiting.

Now you may just have gotten enough from about 1/3rd of the Berkeley course!

What you are describing is known as bootstrapping (if sampling with replacement) jackknifing (if sampling without replacement), or (in the case you want to run a significance test, and not simply create a distribution or stats like confidence intervals) a permutation test. I think you already know that; I'm just mentioning in case others want to look these up by name. Also while they can be called 'distribution free' it only means you are not assuming a prefab distribution. If you want to perform a significance test you'll be creating (explicitly or implicitly) a distribution of your calculated statistic (known as the empirical distribution). If you want to be very explicit about this, you can plot a PDF or CDF of your sampled stats just like you could with a gaussian, exponential, poisson, etc., distribution.

We teach these methods to our students in intro stats at UC San Diego. Have been for as long as I've been here (5 years). Last year a data science program was also created here at UCSD. I've TA'd a flagship course in that program too. It's almost exactly the same content; the major difference is imo are the faculty personalities. The stats profs are smug, while the data science profs are energetically self-important. They teach the same shit. Self motivated students with a STEMy personality tend to learn more in the stats courses because the profs drive on hard core theory; on average though, students do better in the data science course because the profs are so bombastic the kids walk out of each class thinking they are basically ready to join the fellas over at Waymo on some machine learning projects - maybe even show 'em a thing or two, cutting edge tricks learned back at the ol' uni.


Yup. Thanks.

> known as the empirical distribution

Yup, and I wrote:

"out in a tail of the empirical distribution"

Yup, "rank" tests, "permutation" tests: With my TeX markup:

E.\ L.\ Lehmann, {\it Nonparametrics: Statistical Methods Based on Ranks,\/}

And, yup, again with my TeX markup,

Bradley Efron, {\it The Jackknife, the Bootstrap, and Other Resampling Plans,\/}

Last time I knew, Roger Wets was at UCSD. He read one of my papers and suggested JOTA where I did publish it!

Whg is stats full of goofy names to make everything sound more unique and complex than it is?

There's a lot of value in your posts. Mathematizing problems, when successful, brings elegant solutions with well understood properties. Hence, I don't understand the downvotes you are usually getting.

I'm a pure CS / logician by training, but I've spent a few years trying to expand my expertise into probability theory and stochastic processes. Lots of your advice resonates with me. My MSc advisor recommended I should go through Neveu. He was pretty good, had been a student of Pontryagin.

In most academic fields, work that mathematizes the field is regarded as the best.

Neveu is elegant beyond belief. I keep my copy close. I was aimed at Neveu by a star student of E. Cinlar, long at Princeton and before that at Northwestern -- long editor in chief of Mathematics of Operations Research. Neveu was a student of M. Loeve at Berkeley. So was the current darling of machine learning, L. Breiman, because of his Classification and Regression Trees (CART). Breiman's Probability as published by SIAM is generally easier reading than Neveu.

For stochastic processes, there are several relatively different directions to go.

Martingale theory is gorgeous, astounding, amazing, with one of the most powerful inequalities in math, the astounding, tough to believe, martingale convergence theorem, and likely the shortest proof of the strong law of large numbers.

Then can do Markov processes more generally. The discrete state space version is important and not too difficult -- Cinlar has a nice introductory text.

A high end direction for Markov processes is potential theory. There are claims that that is the math for exotic options on Wall Street, but I doubt that there have ever been any applications.

There is a big role for second order stationary stochastic processes in electronic engineering. I ran into that for processing ocean wave data for the US Navy. Here the fast Fourier transform added a lot of interest.

And there's more.

Generally long Russia, France, and Japan seemed to have emphasized stochastic processes more than the US. But by now I suspect that the US is well caught up.

I'd have a tough time believing that very many people with money to hire know enough about high end stochastic processes, or even just Neveu, to hire in those fields. US national security may be about the only hope, that is, outside of academics.

Yes it appears that some of the quantum field theorists in physics are interested in path integrals.

Uh, I'm disorganized here: There is the field of stochastic optimal control!

As usual for advanced applied math, my suggestion is, outside of academics or US national security, find a valuable application and start a business to make money. That is, don't expect to be hired.

Did Neveu ever take a detour into physics? There is a well-known QFT model with his name on it.

Incidentally, I have a copy of Loeve's old probability textbook - I wonder if it is too out of date to be any use.

Loeve is terrific. The stuff there hasn't changed or been much improved on. Some people regard Loeve's book as French that sounds like English or English written as French. But Loeve students Neveu and Breiman are easier starts.

Basically we're talking graduate probability based heavily on measure theory and some on functional analysis.

Last I heard of Neveu he was back in France and at least in part working on probability for stock markets.

There are other more recent authors. It seems that a big fraction of the best researchers sometimes take out a year and write yet another book comparable with Neveu. Personally, I did enough with Loeve, Neveu, Breiman, and Chung and for more attention to such material want to use that background to move on. For now, I'm concentrating on my startup; it's based one some applied math I derived, but I've done that and programmed it. Tonight I got four hard disks installed in my first server. Three of them are partitioned and formatted, and as I type this the fourth is being partitioned and formatted.

If you're referring to Jacques Neveu, his Wikipedia page lists him as having died in May 2016 [1].

[1] https://en.wikipedia.org/wiki/Jacques_Neveu

Thanks for the recommendation on Neveu, I’m going to check it out. If I recall correctly Chung mentioned it in his book as well.

Yes, K. L. Chung's book is also competitive.

you could make that argument even for basic applied math, ... don't expect to be hired ... because the people hiring don't know why they are hiring or what to look for

I loved your post man. I think you are right about the rebranding of Applied Stat as ML|AI. I love stuff like these. I did take 3 Stat course in university. Currently I am working as a dev.I took a course in my free time http://codingthematrix.com/ and loved the programming part of it. Do you happen to know some courses where one would have stat part as well as the programming par?

For this statistics and applied math, at least anywhere near the level of the Berkeley course in the OP, it's by now old stuff, older than nearly all living programmers! Well from various subroutine libraries, some open source, some from, IIRC, the US National Bureau of Standards and Technology, SPSS (Statistical Package for the Social Sciences), SAS (Statistical Analysis System), R, Matlab, Mathematica, LINPACK, CART (Classification and Regression Trees, by L. Breiman and others), and more, there's a LOT of code from quite good up to highly polished. Mostly now people use such code instead of writing it. For stochastic processes, there's code, e.g., the fast Fourier transform for which there is a huge pile of code, for all the different flavors of that curious algorithm.

Well, there is more code to write, but IMHO that would be for relatively advanced techniques or, say, working with terabytes of data instead of megabytes.

If you want to write code for applied statistics, then maybe so indicate, have a portfolio of code, and contact the usual suspects -- US national security and medical research. I'm not optimistic. I've given my opinion -- find a good application and found a startup to monetize it.

It is true that today there is a WSJ article on how technical, with algorithms for trading, Wall Street has become. The article has next to nothing on what applied math is being used but does have lots of names, maybe some you could contact. Actually, the article mentions that Goldman Sachs (GS) got hot on such applied math. Well, that was about when I wrote Fisher Black, of Black-Scholes, there at GS asking about applied math at GS, and I got back a nice letter from Black saying that he saw no such opportunities. Well, the WSJ article today claims that that time was when GS was getting hot on applied math.

If you want to know about applied math on Wall Street, then try to get an opinion or overview from, say, James Simons.

Again, IMHO, it's academics, US national security, medical research, maybe a few other situations, but best of all, start a business, the money making kind.

At my previous job we used some similar techniques in (1) social media monitoring and in (2) cyber security applications. In (1) we had an applied math team working on it, and in (2) I handed the project over to someone doing a math PhD.

To be fair, resampling wasn't the key to our projects, but we were doing a lot of work understanding probability distributions which is not entirely unrelated.

I love this. Thanks a lot for taking the time to write it!

Can you or someone else TL;DR that for us ADHD folks?

What I wrote is much shorter than the Berkeley course.

The idea of resampling is just dirt simple; read it again, just the first paragraphs, and you'll get it.

why is there no syllabus - as in a list of contents? I want to know what really is behind this buzzword stuff.

It's probably around the same content as this semester's iteration: http://data8.org/sp18/. I would read the online book for more info.

There are at least two big turn-offs to this course at first blush: 1) they insist on using anaconda (effectively another package manager complicating the already layered interaction of system pip, virtualenv, virtualenvwrapper etc ). 2) they use Microsoft VisualStudioCode (so, inevitably a good deal of time in this course will be spent learning how to navigate a bloated IDE)

Neither of which are the least bit consequential for anyone more interested in learning about data science.

As it turns out I was wrong about one of those points: in fact the course prefers that you avoid MSVisualStudioCode and instead use the Jupyter notebooks.

But, this bring us back to a much more central topic in data science: the tools and environment DO matter. Hugely.

Reproducibility is central to not just data science but all science. This is facilitated by the use of Free, Open platforms which adhere to common standards.

Imagine trying to debug why someone has a different answer than you when there are x*variant-of-program environments in which they have obtained their answer?

At the most basic level this course should be distributing a Docker image or a VM image of some sort in order to ensure that everyone has the same version of the software.

Even if you do not care about any of the above, please, shed a tear for the student who would like a simple setup.

Thank you.

What exactly is Data Science? It seems like such an overused term and the value of the subject really gets diluted for me when I see charts in Tableau being offered as examples of "data science".

What's the difference between, say, a Master's program in Computer Science where one studies machine learning and a Master's program in Data Science? Am I wrong for thinking the Data Science program weaker?

What exactly is Data Science?

Data Science and DevOps are both just labels for things people have been doing under more mundane terms for 40-odd years.

Even Machine Learning is just a trendy buzzword for what used to be called Predictive Statistics.

I did stats before data science was a thing, and then ran a data science team afterwards, and it's dramatically different.

I've never seen any stats text book or course discuss techniques for dealing with large amounts of data to any significant level, but in data science that is a core part of what you do.

I ran production systems before DevOps and after. Again, it's very different - prior to devops, there was no emphasis at all about using software engineering techniques to manage and deploy software. The most you'd get was some scripts maybe kept in source control if you were lucky.

Now I run an AI company, and a key part of the ML we use involves generating structured text files from images. I guess predictive statistics is technically a correct label, but the tools and techniques are so dramatically different that that thinking of them as separate fields is more correct than incorrect.

I struggled for years to understand DevOps because I couldn’t see what was different from how I worked already... the answer was nothing :-)

DevOps gave a name to what your were doing, that other sysadmins and product devs did not. Is that bad?

The CS program should focus more on data structures and algorithms (and possibly UX and good ol' software dev as well) and the DS program should focus more on statistical/analytical methods and their particular nuances and limitations. If the DS program is done well, with a lot of stats classes, then it is not a weaker program.

Honestly, with the current state of stats teaching, they might be better off just avoiding the stats classes in most cases. I don't need yet another drone that tries to convince me I should make a business decision because of "statistical significance" but can't explain why I should care about statistical significance (p.s. I shouldn't).

Berkeley also used to have this Data Science with Spark series on edX but they taught it just the one time and now even the archived versions of the courses are closed.


I'm so sad this was never taught again. It's the most useful MOOC I've taken, and it motivated me to start using pyspark on a daily basis. I would say the class is better for learning pyspark than actual data science concepts though.

For those interested, you might want to check out http://data8.org I'm not sure how it compares to the OP course, though.

Does anyone know how it compares to bootcamps like DataCamp[0] for e.g?

[0] https://www.datacamp.com

Boot camps are stupid.

Go to community college. It's ridiculously cheap, and the credits are worth something.

How good a community college is often dependent on how wealthy its county is.

Secondly, a data science course may be offered, but only during fall semester and everyone else wants to sign up for it.

Also bootcamps can compress the curriculum from two years worth of junior college into 8-12 weeks. For someone who never enjoyed being in school, I'll take the bootcamp.

"Also bootcamps can compress the curriculum from two years worth of junior college into 8-12 weeks"

Right, and I got a doctorate by listening to an audio tape.

Except no one from a bootcamp is adding "BSc Computer Science" behind their name. And somehow they're getting decent jobs.

If you can learn skills that get you a job from a boot camp they’re not stupid. The fact that Lambda School and App Academy don’t get paid unless you get a job and they still exist suggests rather strongly that they get people jobs.

Skills without the foundations are going to be useless in a technology shift or economic downturn. Also getting a job is great - how about keeping one, or advancing?

It seems very short sighted; community college only takes 2 years. Your career lifetime is what, 45 years? The ROI is insane, why would you shortchange yourself?

I think you radically underestimate how useful work experience is, both in terms of what you learn at work, and in terms of actually having money, rather than not having any. Technology shifts happen all the time. Being able to keep up is a necessary skill but it’s a great deal easier to learn the latest JS framework if you know another one. Given the large number of degreeless programmers and the fact that CS graduates are a minority of working programmers I think we can take it for granted that neither are necessary.

The only boot camp grad I’ve spoken to personally had a degree in Human Genetics, did App Academy after deciding they didn’t want to do it as a career and got a Django job out of it, despite having no knowledge of Python. They had learned enough Ruby on Rails in three months of more than full time work to impress in an interview.

Community college is only two years? That’s half the time needed for an actual Bachelor’s degree, but radically less valuable than one for getting a job. There are two sensible reasons to get an A.A. or A.S., to get a job afterwards or to get the Bachelor’s that comes after. You need to pay for it but more importantly you can’t get a real job during it and you need to eat and live during it.

Even if a good boot camp is strictly inferior to a median A.S. in Computer Science the first can still be a better choice purely because it takes less time. Having known people with Bachelor’s in CS who can’t code I doubt an Associate’s is better.

Who would you hire? The boot camper with two and a half years work experience or the A.S. graduate with one? What if neither of them has a B.A. to go with it? What if both do?

I've met a number of people with university degrees (not community college) who later on go through a boot camp. They can be a great way to change direction.

App Academy takes $5000 last i heard. The hype is that they take nothing though.

What a ludicrous statement. Name a single community college program that touches on even a small portion of something like www.freecodecamp.com

Many if not all of the courses on Edx has a free audit option, like this one. It gives you no certificate and often you cannot access or submit exercises.

Who has time for this?

Unemployed desperate people thinking that this will get them a job.

Guidelines | FAQ | Support | API | Security | Lists | Bookmarklet | Legal | Apply to YC | Contact