Amazing work. The collective intelligence of the world has just gone up.
Another amazing example of visual math lectures, https://www.youtube.com/channel/UCYO_jab_esuFRV4b17AJtAw.
Honestly, I find that the Markov chain website is an example how you should NOT do it. You are replacing the active intellectual process in the head of the student by some trivial animations with scrolling texts and jumping tokens which look nice and dynamic but...what are you learning?
Of course, this doesn't mean that visualizations are useless, but they should be carefully designed with some learning goal in mind. Not easy! For example, the visualization could slowly gray out states that are not visited. In that way, the student could get a feeling for concepts like recurrence etc.
This requires a learner to parse a procedure correctly AND understand how it relates to the topic. This is both doubly harder on the student and doubly likely to force a mistake (either in the procedure, interpretation, or both). Learning with an approach as given in the article approximately halves the work a learner must do, and increases the likelihood that the topic is correctly interpreted.
Retention with this method, however, is a whole other matter- one that does correlate with the amount of mental effort required, but how much so is a matter of debate.
> The visual explanation of the Markov chains on that web site looks elegant and simple because... it is
I strongly resonate with this statement. However, I disagree with the claim the "You are replacing the active intellectual process". In fact, it is the opposite - it is stimulating.
Let me present an anecdote which will perhaps resonate with many people. Often times, when venturing into a new topic, one tends to read the reference texts. Let us be frank - reference texts are not beginner friendly no matter how much they claim (they are reference for a reason). To get to the meat of the story, a reader has to hold lot of context from past knowledge. This is hard and often towards reaching the actual results, the reader is already exhausted. Instead, such intuitive visualizations provide context to a new reader. The next time the reader goes back to the reference text, this "lookahead" (from the visualization) is valuable. It is easier to internalize that where in the grand scheme of things a particular topic lies.
> Of course, this doesn't mean that visualizations are useless, but they should be carefully designed with some learning goal in mind. Not easy! For example, the visualization could slowly gray out states that are not visited. In that way, the student could get a feeling for concepts like recurrence etc.
This I believe is nitpicking.
Or your intellect can be "activated" in the sense that further engaging it leads you to deeper, more general knowledge. Is this website doing it? I don't believe so.
Sometimes the material is important enough that you want to make it as easy to understand as possible. Sometimes the material is a chance to teach your students how to problem solve. Or how to teach themselves new material. Or to help them become mathematically literate (i.e., able to read a book or paper on their own without your help).
A good course/curriculum does both. Some definitions are carefully explained at an intuitive level so that students can struggle with things beyond the defintion. Sometimes the definition is an opportunity to teach students how to go from a formal definition of an intuitive understanding. Or come up with extensions to the definition on their own. Or explore the implications of a definition on their own.
In short, are you teaching the definition of a Markov Chain as key content knowledge, or is this being presented as an exercise in mathematical creativity and mathematical problem solving? There is no "right" answer without more context; a good course will do the former with some topics and the latter with others.
So this is not a topic-level question. It's a course-level or curriculum-level question.
Interestingly, misunderstanding the fundamental tension between "explain well" and "teach people how t learn for themselves" is the source of many public debates on education. E.g., common core Mathematics. Are we teaching students how to add numbers or how to problem solve? Well, ideally both... but the debates always become an either/or.
You know, this is actually a good idea. As long as you didn't set any transition matrix elements to zero, the chain will have a stationary distribution that is given by the lead eigenvector of the transition matrix. If you gave each node a score when the chain passed through it, that eigenvector would manifest as the "score" accumulated by each node over a long period of time.
This is one of the cool, deep properties of the chain - and it's not "insight-free" in the same way you describe.
One of my favorite examples of this is the plinko game at the Museum of Math in NYC:
It's a physical manifestation of the central limit theorem, realized by a series of balls that can go either right or left in a random walk downwards. You can even adjust the bias on the left/right choice to supply drift to the random walk. The mass of balls that accumulates in the bottom is a histogram representation of the associated limiting Gaussian curve. Very cool.
I’m fairly new to programming so would be interested in hearing if someone else does this too...
On a walk this afternoon (for some reason), I realized I had a visual shape in my head for most design patterns. Some of them look a lot like a knot-tying diagram, for example. I’m not sure I had realized that before.
I also realized the shape representations are about as close to implementation as I would prefer to store design patterns in my mind. Beyond that, they might be limiting or, at worst, strenuous to use as guidelines. They’re surely more useful (to me) than UML or even code.
But the best part is I can tell they’ve been there all along. I wasn’t trying to make them up. I was just mapping out a project while on a walk. I was pretty relaxed at the time and maybe that’s why this came up.
For what they are, I was a little disappointed how long it took to parse the Elements of Reuseable OO Software book when I was just learning to code. And then when I had used the patterns and knew them, I found it a bit sad they were communicated through such rigid forms.
Art actually contends with the same sort of conceptual divisions. There’s the medium which is rigid and cumbersome, the form which is more pure, critical in the abstract but medium-agnostic, and the application, being the interface or the thing out in the world.
So I do agree the ability to transpose these concepts between form and implementation is what matters, but it’s the back and forth that matters, not where one begins.
If the traditional teaching approach for Markov Chains required more intellectual effort, sounds like we've been explaining it poorly, no?
This is really really rare. Education as entertainment means that the goal is to get the consumer to feel good about himself, to feel like he learned something. You're getting paid (in money or kudos or whatever) to engage the consumer's narcissism, not his curiosity.
Education as entertainment means that the goal is to get the consumer to feel good about himself, to feel like he learned something. You're getting paid (in money or kudos or whatever) to engage the consumer's narcissism, not his curiosity.
This is a profound observation. I’m not sure it only applies in entertainment.
I’ve heard it put other ways, but your version, if applied to education as a whole, might be a pretty big deal. There’s no shortage of evidence, either; from Ivy League campuses to niche interest subcultures.
Maybe “narcissism” isn’t the best word, but some sort of similar tension related to confidence seems to be associated with successful educational environments. We commonly attribute it to a one-way causal relationship, but you’re articulating the opposite. Likewise, we are used to acknowledging that ‘not giving up’, or ‘belief in one’s self’ is a necessity, but again you’re making a separate claim.
You’re saying, in so many words, the confidence in and of itself defines the will as well as the curiosity. And I can’t argue with it. I want to, but I’m not sure I can.
What I've been trying to articulate here and there in the past few weeks that the object of intellectual development should be acquiring structured ways of not knowing. Rigorous mathematics is such a way: in basic analysis class you get red marks for thinking you know things you can't prove. The buddhism of the Heart Sutra is another: the Heart Sutra boils down to "we want to know big truths about emtpiness and suffering to transcend it and achieve Nirvana, but because of emptiness there's no way to know and nothing to know; therefore, recite these magic words because that's what wise men do". (It sounds like a scam, but it's really deep in how the argument is actually made in length).
This of course runs counter to beliefs in "ultratechnology" such that we'll soon soon craft our own transcendence. But eh, we shall fight in the shade.
These are the questions you'll be running into when you show this to middle and high-school students who have no idea what state space is and what matrix operations are. Like most math visualizations, this one seems best for people who are already somewhat familiar with the material.
I think it's a cool animation, but I'm almost sure I'm suffering from a hindsight bias. My physics professor used to say that nothing is worse than an analogy taken too literally.
If you don't keep your learning organs stimulated they will literally wither and shrink.
we should make things unnecessarily hard to understand if they aren't so
glad yall are still enjoying it.
Had to be said :)
Integrating Explained Visually into digital coursework flows would be amazing.
(Disclaimer: have not taken any online courses recently, so my memories of horrendous archaic information management are wildly outdated I hope.)
So if you have observations A, B, C, B, C, A, D, A, ... then you form pairs: (A,B), (B,C), (C,B), (B,C), (C,A), (A,D), (D,A), ...
Each node in the MC is one of these pairs, and (X,Y) -> (Y,Z) if Z is the new state that appears.
The number of states grows very quickly, but if you make the system dynamic then it works fine. This is what you use to randomly generate text using the previous 3 words as a predictor for the next. Then (w0,w1,w2) predicts w3, and your next "state" is (w1,w2,w3).
And so on.
After reading through theorems and proofs for dynamical systems theory, and especially considering applications of that theory, I came up with basically two practical take-aways:
1. Complexity matters and the curse of dimensionality is real.
2. All those nice theorems (stability, convergence, splitting, invariance, bifurcations, ...) are stated with respect to a state space and make assumptions about the dynamics on that state space. So if you code up a state space representation that is history-aware with respect to the dynamics you really care about, then:
a) you might lose some theorems that hold for the original process, and
b) you sometimes realize the theorems lose a lot of power when they're telling you something about a tuple of things you care about instead of the actual thing you care about.
If I ever teach people about random dynamical systems, this explanation will come immediately after the definition of a Markov process and I'll continually remind my audience throughout the course. Understanding what goes wrong when a system is "technically" Markovian is probably the best way to obtain a deep intuition. Or at least, it was very helpful to me.
I used a similar scheme to your example, but with letters for generating a word for http://password.supply
> Instead of making predictions from a state that depends on
the entire history, an autoregressive model directly predicts yt using only the k most recent inputs, xt−k+1,…,xt. This corresponds to a strong conditional independence assumption. In particular, a feed-forward model assumes the target only depends on the k most recent inputs. Google’s WaveNet nicely illustrates this general principle.
Seems like it would not solve rtkwe's concern. In their example using tuples make a node state equal (S, COUNT_OF_DAYS), (R, COUNT_OF_DAYS)? You would still need an infinite amount of nodes to represent an infinite precision integer state.
So no, you don't need an infinite amount of memory, it's all still finite (although rapidly gets large).
One simple way to add memory is to make states more complex. For example, instead of recording the weather each day as a state, you could record the weather for two day tuples. This allows more complex dynamics, but the number of states quickly increases.
Hierarchical Markov Models are another way to achieve this. In a HMM, there is a probability matrix for different types of weather (winter vs summer) and a probability matrix that governs transitions between each of these types.
It really depends on how granular you want to cut it.
You could make 10 rainy states, each with an increased probability of sunny, equal chance to stay or go to a lower probability, and no chance of going to a state with a higher probability.
So R90 could be S = 10%, R90 = 45%, and R80 = 45%.
R80 could be S = 20%, R80 = 40%, and R70% = 40%.
That way you don't need to track how many days it's been raining, it will naturally filter down to lower rainy percentages until it inevitably goes back to sunny.
One method for longer-term dependence is simply expanding the transition matrix and state space, but that’s not unbounded in any meaningful sense.
A recurrent neural network approximates unbounded long-term information for general tasks, but I think what you’re looking for in particular is a push-down automaton, while a Markov chain is a finite-state automaton.
A "stack" is a stack of (vector, scalar) pairs (a vector and its "weight"). At each time step, the stack has three inputs: the vector to push, the weight to push it with, and the weight to pop off the stack. Its output is the top 1.0 weight of the stack. The stack's behavior on a time step is divided in to three parts:
1. First, that pop weight is removed from the stack. For example, if you want to remove 0.6 from a stack like [(a, 0.4), (b, 0.4)], you'd remove all of a leaving you with 0.2 to remove from b, so the final stack would be [(b, 0.2)].
2. The pushed vector is placed on the stack with the given push weight.
3. The top 1.0 of the stack is blended together and returned. For example, a stack like [(a, 0.4), (b, 0.4), (c, 0.8)] would take 0.4 of a, 0.4 of b, and 0.2 of c.
This process is differentiable, and you can combine this with the usual recurrent loops to make a recurrent stack machine.
In any case, unless you're using Markov chains to generate data that needs to be kept secure (Like generating a password that could be pronounced), the predictability of random numbers isn't really relevant.