Hacker News new | past | comments | ask | show | jobs | submit login
How Khan Academy is using machine learning to assess student mastery (david-hu.com)
224 points by divad12 on Nov 2, 2011 | hide | past | favorite | 46 comments

You should look at liblinear and Vowpal Wabbit. The former gives you a super-fast regularized logistic regression (and many other things). The later gives you online classification, and you can fake a probabilistic model.

On a related note, you're wasting clicks using A/B testing. I emailed you guys about using a better online method (a bandit algorithm) but never heard back from anyone. If that's of interest, drop me a line (noel at untyped dot com).

Update It's occurred to me that you're using GAE, and so probably can't run C libraries like the above two projects. There is a Java library here: http://code.google.com/p/boxer-bayesian-regression/ If you're going to do per user and per exercise models you'll have many fewer data points to train your models on. You should consider sharing data between models or use a model that will give some measure of uncertainty in it's predictions. The Bayesian LR code I referenced above will give some measure of uncertainty. There is a stack of (really interesting) work on other methods that will also do this.

I have nothing to do with KA but I am curious by what you mean about A/B testing and bandit algorithms. Would you mind elaborating or sharing a link or two?

(Reads links.) I've been going around telling people for a while that A/B testing is non-Bayesian but I didn't realize there was an off-the-shelf solution! You need to pimp your wares more often.

It doesn't sound terribly Bayesian to me. From one of his pages:

However, the significance calculation makes a critical assumption that you have probably violated without even realizing it: that the sample size was fixed in advance. If instead of deciding ahead of time, “this experiment will collect exactly 1,000 observations,” you say, “we’ll run it until we see a significant difference,” all the reported significance levels become meaningless.

The original intent of the experimenter shouldn't affect his conclusions!

With that said, I'm not sure what the proper Bayesian answer is. If I were to stop my experiments every time that P(X|D) < 0.05 (or whatever), wouldn't that introduce bias? [1]

Incidentally, you (Eliezer) had a nice post where you explained the problem in detail (something about two doctors doing an experiment 50 times with different "stopping rules", who get the same data), but I couldn't find it.

[1] EDIT: On further thought, there wouldn't be any bias if I do the correct thing and report my uncertainty: P(X|D) = 0.04. The "repeated significance testing errors" comes from converting P = 0.04 into an absolute statement "X is true". Then the stopping rule will affect the statistics.

Changing stopping rules after seeing the data creates bias/distortions. One does have to set the rules in advance, that's not a mistake.

Consider a data set consisting of a single license plate number. (This example is from Richard Feynman.)

If you set the rules in retrospect, you can go "Wow, what were the odds my one license plate would be XJKDL-2342-KE? One in a million?" But that's wrong.

On the other hand if you predicted XJKDL-2342-KE in advance, then the same data point would have a different meaning. How did you predict it?

Patterns that you can predict in advance are different from ones you can find in retrospect after looking through whatever results you get. So the same data point -- XJKDL-2342-KE -- can take on different meaning depending on the original intent and design of the experimenter.

People make this mistake all the time with more mundane examples. Like they will roll snake eyes three times in a row, then calculate the odds of that happening, and then says "wow 1/6^6, there was such a minuscule chance i'd get screwed like this". but they're just wrong. ANY exact ordering of the 6 individual dice roles has a 1/6^6 chance of happening, and you have to get one of the "unlikely" results.

To help make this more intuitive, consider that they would have been surprised by rolling all 2s, or all 3s, or various other patterns. So you at least would have to figure out how many outcomes they'd deem surprising and figure out what proportion of the possibilities are in that category. And then take into account all the rolls they made when they weren't surprised and didn't record any data...

Thank you for your comments - there's a lot of issues about this problem that I'm not entirely comfortable with.

With that said, I'm not sure that I see the connection between what you're arguing and the significance problem in the original. What do you think of the example with the two doctors? http://lesswrong.com/lw/mt/beautiful_probability/

Changing stopping rules after seeing the data creates bias/distortions

We're talking about a fixed stopping rule, which depends on the data.

Oh, I see what you mean about fixed. It's hard to generalize about all data-dependent stopping rules. I do think the one that says "If I get one conclusion, stop. If I get the other, keep trying," has got to be a bad idea! It prevents you from possibly finishing an experiment that concludes you're wrong. The possibility of not terminating seems especially problematic. But if the rule was more like "Roll 10 dice. If at least one came up 6, roll one more die." then it might be harmless and just complicate analyzing the results.

I think that even if you get the same data set as your results, as in the linked doctor example, the stopping rule does matter. A reason for this is that it affects the repeatability of experiments. Experiments ought to be repeatable (within margins of error, and using all the same procedures including stopping rules). Let's consider what could happen with repeat trials. Suppose the medicine almost but not quite works according to the standard you're trying to test for that will make the medicine considered a success (cures at least 60 out of 100 people, on average -- not the most realistic standard, but that's not important). Also suppose the data set had 60 cures exactly, not 70 (the following is technically possible, but unlikely, with 70).

So, the first experiment with a fixed N=100 will not be repeatable. Slightly too few people will be cured in future trials. The success on the first attempt was good luck within the margin for error (the real average cure rate is 59 out of 100).

The second experiment, however, will eventually report that the medicine works on at least 60% of patients on average in all 10 (or whatever) repeat trials, even though this is false. (I think. Maybe they won't make that false "on average" claim but will conclude something different instead? What?) The reason this will happen is basically the same reason that if you flip a coin enough you random walk away from the average (and eventually visit both sides of the average). And because the real cure proportion is so near the goal, there's a pretty good chance you could do a lot of repeat trials without any stalling out for years (which might clue some people in to the problem, though if they strictly ignore data from studies that haven't stopped yet, then maybe it wouldn't).

Similarly, imagine a study of coins which had a stopping rule to stop whenever you have at least 60% heads. You'll always be able to get that result and conclude the coin is biased, even if all coins used are fair. You'll often be able to get that result pretty quickly (and actually if you don't get it quickly, but hover around average, the expected time to get it will keep getting worse. But I bet we could come up with an example that doesn't have that property. Or we could consider 20 research groups, 15 of which report coins are biased and 5 of which never publish.). But the point is their result, claiming coins are biased, may be wrong. Even the possibility of the method getting a wrong answer, without anyone having made a mistake in doing it, is a major problem!

If someone said, "Never mind their stopping rules, I want to salvage their coin flipping data and use it for my other project" I think they would have a serious problem because it's not a proper random set of coin flip data but is instead limited to various possible sequences of flips and not others.

Now it could always be that trials with bad stopping rules get lucky and are correct, and using or believing their data won't work out badly. Their data set could happen to be identical to one that is properly collected. But I think one always has to fear the possibility that they didn't get lucky and their stopping conditions have spoiled the data (especially when you don't have a properly done trial with identical results to compare with) just as the people trying to prove coins are biased could easily spoil their data using fixed but unreasonable stopping conditions.

It's getting too late for me to think about statistics :) A few points:

Similarly, imagine a study of coins which had a stopping rule to stop whenever you have at least 60% heads. You'll always be able to get that result and conclude the coin is biased, even if all coins used are fair.

This is not true. Because of the law of large numbers, the probability of ever reaching the 60% decreases with time.

I do think the one that says "If I get one conclusion, stop. If I get the other, keep trying," has got to be a bad idea!

I understand what you're talking about. I see the potential for a problem. But my understanding is that Bayesian statistics isn't subject to that.

Proper Bayesian result reporting doesn't say "We believe that the coin is biased". We would rather say "The probability that this coin is biased is 60%, subject to our assumptions and model".

My feeling is that this statement is true:

If the model and assumptions are correct, then the Bayesian outcome will be true regardless of the stopping rule.

In this case: 60% of coins for which the Bayesian analyst proclaims P(biased)=0.6 WILL be biased (barring sampling variations). The stopping rule doesn't matter.

I'll try and figure out a solid explanation by tomorrow.

FYI I edited my post to mention the issue about the coins (your first point) shortly after submitting it. I'm guessing you read the non-edited version.

> Proper Bayesian result reporting doesn't say "We believe that the coin is biased". We would rather say "The probability that this coin is biased is 60%, subject to our assumptions and model".

I'm not really sure what you're getting at here. None of the coins are biased, by premise, so they shouldn't be concluding either thing.

If you throw in "if our model and assumptions are right" then you can shift the blame (if they assumed their stopping rule was OK, or came up with a model that says it's OK). But I'm not sure how that substantively helps.

Will check back tomorrow for further comments from you.


I've been thinking about the problem a lot today. I'm pretty sure that my point is basically right, if the model is correct, but my ideas are not clear enough to explain it properly. Model correctness in Bayesian statistics is a complicated problem, and as far as I can tell, it's not a completely solved one. Bayesians usually agree about their calculations, but there's heavy debate about the "philosophy".

In any case, maybe you'll find Eliezer's other post insightful:


I really hope to figure out model correctness, and this optional stopping problem looks a good vector of attack.

Thank you for the discussion, and sorry for leaving you hanging!


(if there's any Bayesian out there willing to continue the discussion, my email is in my profile)

I was thinking it through more and I think it's the stopping procedures that might not halt that are the problem. You can have a data dependent stopping procedure if it's guaranteed to halt which makes sure all data does get counted.

For example of one that might seem bad, but does halt, and turns out to be OK:

Flip a coin until you have more heads than tails OR reach 500 flips.

This procedure will produce a majority of trials with more heads than tails, but I think the average over many trials will be 50/50. The conceptual reason is that stopping early sometimes prevents just as many heads as tails that would have come up after stopping. I haven't formally proved this but I did a simulation with a million trials with that stopping procedure and got a ratio of 1.0004 heads per tails which seems fine (and after some reruns, I saw a result under 1, so that is possible). Code here:


With a guaranteed halt, a sequence of 500 tails and 0 heads can be counted. With no guaranteed halt, it's impossible to count a tails heavy sequence, which is not OK because it's basically ignoring data people don't like.

Does that make sense? I think it may satisfy the stuff you/Bayesians/Eliezer are concerned with. It means it's OK to stop collecting data early if you want, but you do need some rules to make sure your all your results are reported with no selectivity there.

There's also a further issue that these kinds of stopping procedures are not a very good idea. The reason is that while they are OK with unlimited data, they can be misleading with small data sets. It's like the guy who bets a dollar, and if he loses he bets two dollars, and if he loses again he bets 4 dollars (repeated up to a maximum bet of 1024 dollars). His expectation value in the long run is not changed by his behavior but he does affect his short term odds: he's creating an above 50% chance of a small win and an under 50% chance of a larger loss. If you only do 10 trials of this betting system, they might all come out wins, and you've raised the odds of getting that result despite leaving the long term expectation value alone. Doing essentially the same thing with scientific data is unwise.

BTW/FYI I believe I have no objections to the Bayesian approach to probability but I do think the attempt to make it into an epistemology is mistaken (e.g. because it cannot address purely philosophical issues where there's no data to use, so it fails to solve the general problem in epistemology of how knowledge (of all types) is created.)

You're stuck in frequentist thinking. "Bias" is a property of repeated sampling -- the expectation over repeated samples. But we just have one! The relevant question is what is your best guess for p, the probability the coin will be heads.

Under a uniform prior [0, 1] the posterior mean is the empirical mean. How you sample is of no consequence. The likelihood/posterior f(p|#heads, #tails) is p^(#heads)(1-p)^(#tails) regardless of how you sample. Differentiate with respect to p and you get p*=heads/total.

It is rather amusing that most statistics professors are happy to have taught their students that the sampling procedures matter while at he same time crushing the natural intuition that your decisions should be based on the data you observe not on what might have happened in a world that doesn't exist.


Consider an infinite string of coin flips. Now consider a subset selected by a stopping rule to meet a particular criterion. And a different subset chosen with an N=100 criterion. The first stopping rule creates a bias: you have a non-random sample chosen to meet that criterion. The second stopping rule doesn't do that, it gets what we call a "random sample".

If someone then takes your dataset and assumes it's a random sample -- e.g. just the same as the N=100 doctor trial -- he's wrong. It's not, it's something else, and that something else is less useful.

You say "how you sample is of no consequence". But suppose your sampling method selectively throws out some data that it doesn't like. That is of consequence, right? So sampling methods do matter. Now consider a method which implicitly throws out data because some sample collections are never completed. That matters too.

Yes, clearly. I stated that too strongly. Sampling procedures can definitely matter enormously, but stopping rules are within a class of ignorable rules. The link above gives a more precise definition.

I think that you are mostly right about halting (guaranteed) stopping rules.

See my other comment, up a few times then down the other branch, the one with the pastebin code.

However the example with the two doctors was not the halting type.

Can you agree to that? Or do you have a defense of non-halting stopping rules, even though they are incapable of reporting some data sets?

I think I figured this out but would be interested in criticism on this point if not. Is there some way of dealing with non-halting that makes it OK?

The book says if there's a stopping rule then inferences must depend only on the resulting sample but that assumes there is a resulting sample -- that the procedure halts.

Off-topic, but what happened to statsia? I was curious to see what you were working on.

Website is down but the project continues. Email beta@statsia.com if you want to be put on our insider's list.

Changing stopping rules is perfectly fine in a Bayesian setup. It's the likelihood principle. This is one of the central differences that arises when you condition on the data (Bayesian) rather than the parameter (frequentist).

Ced, the part about the significance calculation is a problem with the A/B test, not a problem with the Bayesian bandit version.

The post you're looking for is http://lesswrong.com/lw/1gc/frequentist_statistics_are_frequ...

Nah, it's not. The original had a setup like:

Doctor A decides to test a cure on 50 patients. 40 have gotten better. Doctor B independently decides to test the same cure on his patients. He will stop once he has reached 'significance'. Coincidentally, the results become significant at the 50th patient, and he also has a 40/50 success rate.

Doctor A says "I followed a fixed testing procedure, and the statistical analysis says that my data is not significant. We need more experiments."

Doctor B says "I followed an optional stopping procedure, and the statistical analysis says that my data is not significant: the cure is good."

A Bayesian would claim that if they both have the same data, then they should reach the same conclusion, regardless of their intent.

A "frequentist" would uphold that the doctors can legitimately disagree. I don't know much about frequentism, but it's the dominant perspective in statistics. Everything I've read about A/B testing is frequentist.

Yes! Thank you.

It's interesting how simple their feature set is. I imagine the two EMAs and the percent_correct are probably the most important inputs. A few other interesting ones might be the percent correct for say the past 10 or 20 questions (instead of explicitly cutting off the last 20 questions as they mention). They may also want to pick a different response, like % of next 10 the user gets right instead of probability of getting the next question right.

Wow, this is literally the exact application of the Stanford Machine Learning class up to this point.

I was thinking the exact same thing. It was really neat to read about a real world use of what I'm learning in the class.

It makes me wonder if the data set is available as well.

the professor did say people make careers off of what we already know with linear and logistic regression.

I loved how giddy he was in the class videos when he stated that.

Good stuff, but the thought occurs to me that what you really want to know is when the user has stopped learning - i.e., when proficiency stops increasing as a result of doing more problems - and how much each individual problem increases proficiency. But that would undoubtedly be more complicated.

Another interesting mechanism is what ChessTempo.com uses. Players and puzzles both have a Elo-like dynamic ranking. When a player beats a puzzle, the player's rank increases and the puzzle's rank decreases. This makes it easy to, over time, present players with puzzles that are of appropriate difficulty.

Another ingenious approach is taken by chesstempo.com, a chess training site. Just as in chess itself the ratings of players are determined by pairwise comparisons (games between players), they pair players up against problems. If they solve the problem, the rating of the problem goes down, the rating of the player goes up. Players are given problems close to their ratings, which keeps everyone happy. I believe they use Glicko to track uncertainty in the rating.

Chapter 22 of David Barber's "Bayesian Reasoning and Machine Learning" (he makes it available online) does a nice (perhaps brief) job of explaining the progression through the Rasch model, the Bradley-Terry-Luce model and Elo.

As an aside, the way they chesstempo generate the exercises is also cute. The tactical chess problems are positions taken from high level (human) games fed into a chess engine which identifies blunderous moves where there is a single distinctly best way to respond. The challenge is to find that best move. Because they are taken from real games, they have the appearance and feel of real positions, which is important; many people believe pattern recognition is an important part of chess mastery. Apparently they've built up nearly 40000 such tactical exercises.

How awful, I just saw [ja27 17 hours ago]. I've even managed to describe it in almost exactly the same way..

I think we should note that logistic regression has been around forever, and should probably be considered property of "statistics", not "machine learning".

Iteratively solving for the model parameters using gradient descent is not standard practice in a statistics class; paying attention to the numerical methods behind logistic regression is a very CS kind of thing to do.

this is literally the first thing taught in both of the machine learning classes I've taken from Prof. Ng at Stanford, so maybe it's the application of logistic regression more than the estimation technique itself that makes machine learning?

I think it's more that machine learning builds off of earlier work in statistics. If I remember correctly, that class discusses lots of other things that are used in ML but were invented elsewhere (gradient descent, maximum likelihood, matrix methods, etc.)

Arguably this is all just semantics (nature knows no stats/ML divide), but as a ML person I know this drives statistics folks crazy.

There used to be an earlier era of machine learning that wasn't as statistical. Ng, and most other current ML researchers, now heavily draw on mainstream statistics. It really does make sense to do logistic regression as the foundation for later stuff.

The terminology confusions, I think, stems from the earlier era of ML research.

What did the earlier era use?

In a case like this, perhaps a perceptron?

Which is Ng's class's next chapter after logistic regression. :-)

>This was a fairly large change that we, understandably, only wanted to deploy to a small subset of users. This was facilitated by Bengineer Kamen's GAE/Bingo split-testing framework for App Engine.

I think this method of A/B testing has some faults. I blogged about it A/B testing. Is Khan doing it wrong?http://liveatthewitchtrials.blogspot.com/2011/09/ab-testing-...

and Allen Downey ran some simulations at Repeated tests: how bad can it be?http://allendowney.blogspot.com/2011/10/repeated-tests-how-b...

My question is how does time dependency work in this case. I am trying to wrap my head around how a prediction engine would work when your assessing students on the basis of not just past/current performance but also how much time their taking while answering each question.

I think you can model for randomness (kids getting lucky while answering a question), but if you can somehow add time-dependency to the model, then your predictability would be higher (of course this is pure speculation).

Does anyone have a good model I can look at? Any help would be appreciated.

Time-to-answer is a numeric, but perhaps not linear, predictor.

To solve the problem that people dont do more problems after becoming proficient, consider forcing a randomized subset to solve one extra problem for aquiring proficiency. Don't tell the users when this happens though, just show the bar as not quite full

How soon until machines will be teaching us and grading us?

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact