Markov Models
The Markov Assumption
A Markov model is a model in which the probability distribution of future states depends only on the current states. This is known as the Markov assumption.
For example, we could denote the weather on day $t$ as $X_t$. The Markov assumption is then that the weather today only depends on the weather yesterday  that is, the days before yesteday don't convey any additional information. Formally, we can state this $$\forall t \; p(X_{t+1}X_t=x_t) = p(X_{t+1}=x_t  X_t=x_t,\dots,X_0=x_0)$$
where the lowercase $p$ yields the distribution of a variable, rather than the raw probability given by the uppercase $P$.
At first this Markov assumption may seem very restrictive, but it is actually not too bad. Take the weather for instance. Suppose, we wanted to take into account the last three days, rather than just yesterday. Then, we can just let $X_t$ represent three days of weather, rather than one  thereby satisfying the Markov assumption.
Similarly, we can easily build Markov models over multiple variables. For instance, if we want a model over variables $A$, $B$, and $C$, then we can define a new variable $X$ as a tuple of $A,B,C$ values.
Transition Matrices
Suppose that the domain of our random variable $X$ is the set of integers between $0$ and $n$. Then, given that we're in state $i$, the probability of transition to state $j$ is simply some constant number, which we can denote $a_{ij}$. As the notation might indicate, we can use this to define a $n \times n$ matrix, $A$, which will soon prove to be useful.
Suppose at the beginning, you had an $n$component vector $\vec{v}$ for which $\vec{v}_i$ represents your prior of beginning in state $i$. It turns out that $A \vec{v}$ must likewise yield a vector representing the probability you assign to being in a state at time $1$. Likewise, $A^t \vec{v}$ yields the probabilities you assign to be in different states at time $t$.
Example
Suppose we were modeling the weather with two states: sunny and rainy. We can draw such a model as
The model above, for instance, indicates if yesterday was sunny, then today has a 70% chance of being sunny and a 30% chance of being rainy. If yesterday was rainy, on the other hand, then today has a 80% chance of being sunny and a 20% chance of being rainy.
If we call "sunny" 0 and "rainy" 1, then we can represent this as a transition matrix: $$A = \left[\begin{matrix} 0.7 & 0.8 \\ 0.3 & 0.2 \end{matrix}\right]$$ or, more generally, $A=[a_{ij}]$ where $$a_{ij} = P(X_{t+1} = i  X_t = j)$$
Suppose the first day was sunny: $$\vec{v}_0 = \left[\begin{matrix} 1 \\ 0 \end{matrix}\right]$$
Then we can compute the probability distribution for the second day: $$A \vec{v} = \left[\begin{matrix} 0.7 & 0.8 \\ 0.3 & 0.2 \end{matrix}\right] \left[\begin{matrix} 1 \\ 0 \end{matrix}\right] = \left[\begin{matrix} 0.7 \\ 0.3 \end{matrix}\right]$$
This should make sense if you think about the model. Likewise, we can compute the probability distribution of $2$ days: $$A^2 \vec{v} = \left[\begin{matrix} 0.73 & 0.72 \\ 0.27 & 0.28 \end{matrix}\right] \left[\begin{matrix} 1 \\ 0 \end{matrix}\right] = \left[\begin{matrix} 0.73 \\ 0.27 \end{matrix}\right]$$
Taking the Limit
In the example above, we can consider what happens when $t$ goes to infinity by evaluating $$\lim_{t \rightarrow \infty}{A^t \vec{v}}$$
It turns out that in lots of models we care about, this limit does exist. For instance, in the example above, we get $$\lim_{t \rightarrow \infty}{A^t \vec{v}} = \left(\lim_{t \rightarrow \infty}{A^t}\right) \vec{v} = \left[\begin{matrix} 8/11 & 8/11 \\ 3/11 & 3/11 \end{matrix}\right] \vec{v}$$
Note, that this means that regardless of wheather the first day is sunny or rainy, the probabilities eventually converge to an $8/11$ chance of sunny: $$\left[\begin{matrix} 8/11 & 8/11 \\ 3/11 & 3/11 \end{matrix}\right] \left[\begin{matrix} 1 \\ 0 \end{matrix}\right] = \left[\begin{matrix} 8/11 \\ 3/11 \end{matrix}\right]$$ $$\left[\begin{matrix} 8/11 & 8/11 \\ 3/11 & 3/11 \end{matrix}\right] \left[\begin{matrix} 0 \\ 1 \end{matrix}\right] = \left[\begin{matrix} 8/11 \\ 3/11 \end{matrix}\right]$$
Another way to interpret this is that, in the long run, 8 of 11 days will be sunny. More generally, the eigenvector of the transition matrix can be interpreted as the expected proportion of time we'll spend in each state. We call this expected proportion a stationary distribution
Note, just as a single matrix can have multiple eigenvectors, a single model can have multiple stationary distributions. You could imagine, for instance, a very simple Markov model for a person in a video game:
In this model, if you fall into either pit, you'll stay there forever. If you start in World, then the expected proportion of times you'll stay in Pit 1 is 50%  likewise for Pit 2. However, if you start in Pit 1, then you expect to be there for literally every state  a proportion of 100%.
Hence, there are actually an infinite number limiting distributions (eigenvectors), all of the form: $[0,x,1x]$. For many models, however, there is exactly one stationary distribution; in other words, for many models, no matter where you start, you'll eventually reach the same stationary distribution  in this case, we call it a limiting distribution.
Hidden Markov Models
You could imagine the true state ($X_t$) is hidden, but that we have access to a piece of evidence, $E_t$, each timestep. This is the general idea of hidden Markov models. In particular, we assume that the evidence variable's distribution depends only on the corresponding $X_t$.
For instance, imagine you grew up on a farm. Since then, you've moved away, but you come back every 4th of July. You can remember, on a scale from 110, how much rain fell the year you left, but you can only observe the height of the corn on the 4th of July.
In this example, imagine $X_t$ is the raininess of year $t$, and $E_t$ is the height of corn that year: $X_t$ is the hidden variable that determines the evidence $E_t$.
Theory
In particular, as with the normal Markov models, we will still have
 a prior for the starting distribution, represented as a vector
 transition probabilities, represented as a matrix
Before continuing, we should note something that will make our lives much simpler: when applying Bayes' theorem, we can usually ignore the denominator. The denominator is simply a way to ensure the probabilities add up to 1, but if we relax this assumption, then all probabilities will remain proportional to one another. So, we adopt have the theorem $$P(T=tE=e) \propto P(E=eT=t) P(T=t)$$
Now, we can talk about how to algorithmically compute the distribution for $X_t$ given all evidence up until then. However, before giving the actual pseudocode, let's go through the math to establish why the algorithm works.
You start out with a prior distribution for $X_1$. The first step is just to update on the evidence $E_1$ using Bayes' theorem: $$P(X_1=x_1  E_1=e_1) \propto P(E_1=e_1  X_1=x_1) P(X_1=x_1)$$
Note, that the prior distribution for $X_1$, the emission probabilities, and the transition probabilities are all given by and constant throughout the problem, so we've really just used all these known variables to compute the distribution of $X_1$ given $E_1=e_1$.
Imagine, now, that we wanted to compute $P(X_2=x_2  E_1=e_1,E_2=e_2)$. By Bayes' theorem, we can compute this, but let's ignore the denominator, because we can simply normalize it later: $$P(X_2=x_2  E_1=e_1,E_2=e_2) \propto P(E_1=e_1,E_2=e_2  X_2=x_2) P(X_2=x_2)$$
...[todo]
Forward Algorithm
Using this information, we can update in a Bayesian and "dynamic" way as we move.
While we'll present the pseudocode soon, here's the jist:
 Compute the distribution of $X_1$ by using the prior and the evidence $E_1$: $$P(X_1=x_1  E_1=e_1) \propto P(E_1=e_1  X_1=x_1) * P(X_1=x_1)$$

For each $i \in [2, t]$, dynamically update the distribution for $X_i$. In particular, take our distribution for $X_{i1}$ and
 multiply this distribution (vector) by our transition matrix, to compute the distribution for $X_i$.
 multiply this distribution (vector) componentwise the probability of observing $E_i$ given that state
foo(prior, transition, emit, evidence) { // compute the distribution of X_1 dist = {} for (state in prior) { dist[state] = prior[state] * emit[evidence[0], state] } dist = normalize(dist); // dynamically compute later distributions for (t = 1; t < evidence.length; ++t) { new_dist = {} for (old_state in prior) { new_dist[state] = 0; for (new_state in prior) { new_dist[state] += dist[old_state] * transition[new_state, old_state]; } new_dist[new_state] *= emit[evidence[t], new_state]; } dist = normalize(new_dist); } return dist; } normalize(dist) { float sum = 0; for (state in dist) sum += dist[state]; for (state in dist) dist[state] /= sum; return dist; }
To understand this algorithm, you need to realize that dist
represents the distribution of variable $X_t$ given evidence $e_1,\dots,e_t$.
[todo]
Particle Filtering
As implemented above, this forward algorithm runs in $O(S^2 \cdot T)$ time, where $S$ is the number of states and $T$ is the number of timesteps. However, if each state only has a small number of neighboring states, than a bit of cleverness can reduce this to $O(S \cdot T)$.
To improve this running time, we have to drop Bayesianoptimal belief updating. Basically, we choose a subset of states called particles and
 update the probability each particle is correct when we are given evidence
 change each particle's state based on transition probabilities
While the psudocode is a little lengthy, here's the idea:
 Create $n$ particles. Each particle is just a statefloat pair. At the beginning  we randomly select states according to our prior to create each particle, and we set its value to 1.
 For each timestep
 Change each particle's state randomly  according to the transition probabilities
 Update each particle's value by multiplying it by the likelihood it is true given the new evidence.
 Sample $n$ particles from this list with replacement. However, make this a weighted random choosing, so that particles with higher values are more likely to be chosen. This new set of particles is used in the next iteration, and each particle has value 1 again.
Viterbi Algorithm
So, far we have focused on finding the most likely state you are currently in. However, what if you want to find the most likely sequence of states you were in over time? This might be useful for, say, speechtotext models. More formally, we're computing $$argmax_{x_1,\dots,x_t}{P(X_1=x_1,\dots,X_t=x_t  E_1=e_1,\dots,E_t=e_t)}$$
It turns out we can do this in a dynamically
function viterbi(priors, transition, emissions, evidence) { /* probs[s] will be the probability of the most likely path that ends in state s */ Mapprobs[evidence.length]; Map froms[evidence.length]; // initalize probs to priors and mlu for (s in states) { probs[0][s] = prior[s] * emission[s, evidence[0]]; froms[0][s] = null; } // dynamicaly update probs and froms for (var i = 1; i < evidence.length; ++i) { for (state_to in states) { max_p = 0; max_from = null; for (s in states) { p = probs[i1][s] * emission[s, evidence[1]] * transition[s, state_to]; if (p > max_p) { max_p = p; max_from = s; } } probs[i][state_to] = max_p; froms[i][max_from] = max_from; } } // find path's endpoint max_p = 100; endpoint = null; for (s in states) { if (probs.end()[s] > max_p) { max_p = probs.end()[s]; endpoint = s; } } path = [endpoint]; for (var i = froms.length  1; i > 0; i) path.append(froms[i][path.first()]); path.reverse(); return path; }