

Big Data at Khan Academy - dylanvee
http://mattfaus.com/2013/10/big-data-at-khan-academy/

======
t1m
It's interesting work, but it's not really 'big data'.

"Every day, we collect around 8 million data points on exercise and video
interactions, and a few million more around community discussion, computer
science programs, and registrations. Not to mention the raw web request logs,
and some client-side events we send to MixPanel."

OK - 8 million records per day. Let's double that for the argument's sake.

Even if they were fairly fat records (1Kb), that's only 16Gb / day. That makes
it around 2 months / TB.

I can easily put together a machine with 20TB of storage and run a traditional
free relational DB (or even a single free node of Greenplum) and store more
than 3 years of this data.

Then bang against it with SQL. Transactions are free.

~~~
yeukhon
Big data doesn't mean you need 100TB per month. It simply means you have a lot
of data and so enormous that you cannot just read through all the data and
analyze without more durable methods of computations. And 8 million per day is
a lot.

The real question is out of those records they have collected, how much useful
data can they extract and what exactly can they extract out that data set
beside just who visited from where, etc.

------
nl
Interesting.

There's a whole emerging field called "learning analytics", which at the
moment appears to be more a theoretically good idea than anything with
practical outcomes (Sadly, much in education is like this - something will
emerge in the technology field, and then 6 months later there will be a XXX-
in-education movement) - although Khan Academy is in a good position to get
that data and use it.

But for those of you who have kids who do Kumon Math (or similar) it's pretty
easy to see how analytics could speed up the Kumon process (of selecting
questions that exercise very specific skills).

For those interested there is an upcoming "Big Data in Education" Coursera
course[1] that I'm planning on doing. It will be my first coursea experience,
so I'm not quite sure what to expect. I'm in the fortunate position of having
access to a fairly significant amount of educational usage data, so I'm hoping
it will be useful.

[1] [https://www.coursera.org/course/bigdata-
edu](https://www.coursera.org/course/bigdata-edu)

~~~
ZeroGravitas
Only 6 months later? I'd pegged it around 3-5 years.

Your Kumon math suggestion is exactly what Khan academy are doing as their
first big usage of this stuff. See the links I posted elsewhere for more info
from them.

I think the big difference is that Khan Academy is building their whole
approach around the analytics, rather than vice versa. There's plenty of
"free" info you can collect from web or online courses or question banks, but
I've not seen any real concept for feeding that back into the course
construction.

For example, at one "learning analytics" thing I attended it was discovered
that the really big indicator of whether a student was going to pass or fail
was whether they did well on exams. This fact though was buried so deep behind
a super simple traffic light dashboard that no user would have ever been able
to figure it out and do something useful with that info, like for example
change the course to have more and earlier testing, which is easier than ever
with modern technology.

For whatever reason, just testing student's knowledge isn't considered quite
as exciting as trying to predict their success by applying dizzyingly complex
math to the trail they leave on the web.

------
alexatkeplar
Isn't this a flawed approach? It seems like Khan Academy is trying to re-
construct a record of behaviours across their business by stitching together:

1\. Parsing web logs for web page views and API accesses

2\. Exporting "some client-side events" from MixPanel

3\. Mining their transactional databases for state changes

On #1 - web caching and client-side events have long invalidated web log based
analytics approaches. How is Khan different?

On #3 - this is reverse engineering your user behaviours by mining state
changes in your transactional systems. This is typically a ton of work, it
breaks when you change your data models, and your operational systems aren't
designed to reveal user behaviours anyway.

Have Khan explored alternative approaches? Typically: defining with the
analyst team a set of events you want to monitor, making sure all of your
systems (client-side, mobile, server-side, whatever) emit immutable streams of
these events, and then collecting, storing, enriching, analyzing at your
leisure.

------
noelwelsh
This was a nice read but I'm much more interested to know what they do with
the data. From hanging around "big data" people the emphasis still seems to be
on storage and simple SQL-esque querying. For most people this is a solved
problem, and it's time to go beyond storage and see what value we can get from
data. I believe in most cases this requires a different skill set _and_
different mindset. Most people think in binary terms, but statistical models
deal with shades of grey -- nothing is ever certain -- and even simple models
like linear regression are difficult for the untrained to understand.

~~~
ZeroGravitas
A couple of older posts about what they're doing:

How Khan Academy is using Machine Learning to Assess Student Mastery

[http://david-hu.com/2011/11/02/how-khan-academy-is-using-
mac...](http://david-hu.com/2011/11/02/how-khan-academy-is-using-machine-
learning-to-assess-student-mastery.html)

Khan Academy: Machine Learning → Measurable Learning

[http://derandomized.com/post/51729670543/khan-academy-
machin...](http://derandomized.com/post/51729670543/khan-academy-machine-
learning-measurable-learning)

I think the latest Dashboard design that they've just (soft-?) launched is
based on this work. When you first start it asks you about 10 questions and
based on your performance on that short "Maths pre-test" guesses what you know
and what you don't across the whole of Maths.

------
scorpion032
Do you anybody else that uses Google App Engine?

