Graphical Models: Theory

Figure8.46
August 24th, 2013
|

Here at forty.to we are starting to employ "graphical models" into our tool chain. So, to spread the gospel, and hopefully help anyone deciding to get into "graphical models", we decided to write a two part blog post on "graphical models". This first part covers the basics of "graphical model" theory. The second part will cover the practice of using "graphical models".

As we want to keep the post to a manageable length we, unfortunately, have to make some assumptions of the reader. We assume the reader is familiar with calculus, probability distributions, conditional probability distributions, and the sum and product rules of probability. One doesn't have to have a very deep understanding of any of these topics; a cursory understanding should suffice.

So, with these preliminaries out of the way, lets jump in!

Why Graphical Models?

Generally humans find reasoning about and manipulating graphical representations much easier than reasoning about and manipulating equations. So, anytime one can convert manipulations of equations to manipulations of graphical diagrams, it's a win.

One need not look far for examples of just this logic. Feynman diagrams

Feynman Diagram

Feynman Diagram

revolutionized physics, allowing physicists to reason about and manipulate these graphical representations instead of equations. Knot diagrams

Knot Diagram

Knot Diagrams

allowed mathematicians to reason about knots through simple manipulation of  these funny little pictures. More exotically, Kirby Diagrams

Kirby Diagram

Kirby Diagram

allow mathematicians to reason about four dimensional spaces through manipulation of diagrams like that above.

So, it seems only logical to look for a graphical representation of a probability distribution that allows one to easily manipulate and reason about that distribution. Graphical models provide just such a representation.

Bayesian Networks

Consider a probability distribution p(a,b,c) over three random variables a, b, and c. Using the product rule of probability twice, we can rewrite p(a,b,c) as follows 

p(a,b,c) = p(c\mid a, b) p(b\mid a) p(a).

Now we will introduce a graphic form for the right hand side of this equation.

For each random variable we introduce a node and we associate each node with the conditional probability for its variable. Then for each conditional probability, i.e. node, and for each corresponding conditioning variable we add a directed edge from the node corresponding to the conditioning variable to the node corresponding to the conditioned variable.

So, the right hand side of the equation is graphically represented as in Figure 1. (Note Figure 1, and all subsequent figures, were created by Christopher M. Bishop.)

Figure 1

Figure 1

The distribution p(a) is not conditioned; thus, there are no directed edges going in to node a. The distribution p(b\mid a) is conditioned on a; thus, there is a directed edge from node a to node b. The distribution p(c\mid a, b) is conditioned on a and b; thus, there is a directed edge from node a to node c and a directed edge from node b to node c.

So, as the rules for this game are relatively simple, we can explore more complicated diagrams. For example the distribution described by Figure 2

Figure 2

Figure 2

has the form 

p(x_1)p(x_2) p(x_3) p(x_4 \mid x_1, x_2, x_3) p(x_5 \mid x_1, x_3) p(x_6 \mid x_4) p(x_7 \mid x_4, x_5).

One can form an infinite number of such graphs, nodes representing random variables and directed edges conditioning. Generically such a graph goes under the name of a Bayesian network, and in the following sections we will study Bayesian networks in some detail.

Conditional Independence

Beyond just looking pretty these graphical models also help one reason about "independence" and "conditional independence".

As you will recall, two random variables a and b are independent if the following equation holds 

p(a,b) = p(a) p(b).

This notion of independence can be generalized to the case of conditional distributions as well. Given the random variable c two random variables a and b are conditionally independent if the following equation holds 

p(a,b \mid c) = p(a \mid c) p(b \mid c).

Often one is given a joint probability distribution and one needs to determine if two, or more, random variables in the joint distribution are independent or conditional independent. Given a graphical model for the distribution, this question can be answered relatively easily, as we will show below.

Tail to Tail

First consider the probability distribution depicted by the graphical model of Figure 3.

Figure 3

Figure 3

Using what we have discussed previously, it has the form

p(a,b,c) = p(a \mid c) p(b \mid c) p(c).

Let us assume that c is not observed and, in the context of this assumption, examine whether a and b are independent. Using the sum rule of probability we can write the distribution p(a,b) as

p(a,b) = \sum \limits_c p(a,b,c).

Substituting the form of the distribution p(a,b,c) gleaned from the graphical model, we have 

p(a,b) = \sum \limits_c p(a \mid c) p(b \mid c) p(c).

So, generically this distribution does not factor into a product of p(a) and p(b). Hence, a and b are not generically independent.

Consider now the case in which c is observed. In our graphical model we will denote this by coloring the node c in, as in Figure 4.

Figure 4

Figure 4

Again, using what we have learned about graphical models, the joint distribution is 

p(a,b,c) = p(a \mid c) p(b \mid c) p(c).

In addition, the product rule of probability implies

p(a,b,c) = p(a, b \mid c) p(c).

Together these last two equations imply

p(a, b \mid c) = p(a \mid c) p(b \mid c),

where we have used the fact that p(c) \neq 0, as c was observed, to divide through by p(c). This last equation implies that if c is observed, then a and b are conditionally independent.

Generally, the situation depicted in Figure 4 gives rise to the tail-to-tail rule of conditional independence. It states that if an observed node is connected to the tails of two directed edges, e.g. c in Figure 4, then the two nodes at directed edges' heads, e.g. a and b in Figure 4, are independent. This rule simply formalizes what we discovered above.

Head to Tail

Consider the probability distribution depicted by the graphical model of Figure 5.

Figure 5

Figure 5

Its joint probability distribution can be written as follows

p(a,b,c) = p(a) p(c \mid a) p(b \mid c).

Let us assume that c is not observed and, with this assumption, examine whether a and b are independent. Using the sum rule of probability we can write the distribution p(a,b) as follows

p(a,b) = p(a)\sum \limits_c p(c \mid a) p(b \mid c).

Using the product rule we have p(c \mid a) p(b \mid c)=p(b,c \mid a). The sum rule then implies that the above equation can be written as 

p(a,b) = p(a) p(b \mid a).

Generically this implies that a and b are not independent if c is not observed.

Consider now the case in which c is observed; this is depicted in Figure 6.

Figure 6

Figure 6

Again, from our study of graphical models, we know the joint probability distribution can be written as follows

p(a,b,c) = p(a) p(c \mid a) p(b \mid c).

The product rule of probability implies 

p(a,b,c) = p(a, b \mid c) p(c).

So, together these imply

p(a,b \mid c) = \frac{p(a) p(c \mid a) p(b \mid c)}{p(c)},

where we have used the fact that p(c) \neq 0 as c has been observed. However, Bayes Rule states that

p(a \mid c) = \frac{p(c \mid a)p(a)}{p(c)}.

So, together that last two equations imply 

p(a,b \mid c) = p(a \mid c) p(b \mid c).

This simply states that if c is observed, then a and b are conditionally independent.

The situation depicted in Figure 6 is that of the head-to-tail rule of conditional independence. If an observed node is connected to the head of one directed edge and the tail of another, e.g. c in Figure 6, then the two nodes at the edges' other ends, e.g. a and b in Figure 6, are independent.

As a mnemonic for the tail-to-tail rule and the head-to-tail rule, one can think of "influence" as "flowing" from  node a to node b and from node b to node a along the various edges that pass through c. When c is observed, and subsequently its node is colored in, this "influence" is "blocked" and thus a and b are independent as a result.

Head to Head

The tail-to-tail rule and the head-to-tail rule were relatively simple. We will now examine a more complicated rule which is derived from the distribution described in Figure 7.

Figure 7

Figure 7

The joint distribution described by Figure 7 is given by

p(a,b,c) = p(a) p(b) p(c \mid a, b).

We can use the sum rule of probability to determine the distribution p(a,b). It is given by

p(a,b) = p(a) p(b)\sum \limits_c p(c \mid a, b) = p(a) p(b),

where for the last equality we have used the fact that p(c \mid a, b) is a probability distribution and thus the sum over c is 1. Obviously, this implies that a and b are independent. Note, this is exactly opposite of the tail-to-tail and head-to-tail cases where, if c was not observed, a and b were not independent.

Consider now the case in which c is observed. This corresponds to Figure 8.

Figure 8

Figure 8

The joint distribution described by Figure 8 is given by

p(a,b,c) = p(a) p(b) p(c \mid a, b).

The product rule of probability implies 

p(a,b,c) = p(a, b \mid c) p(c).

Hence, together these imply

p(a,b \mid c) = \frac{p(a) p(b) p(c \mid a, b)}{p(c)},

where we have used the fact that p(c) \neq 0 which follows from the fact that c was observed. Now, generically this does not imply that a and b are conditionally independent given c.

The situation depicted in Figure 8 is the head-to-head rule of conditional independence. If an observed node is connected to the head of two directed edges, e.g. c in Figure 8, then the two nodes at the edges' other ends, e.g. a and b in Figure 8, are dependent. So, the situation here is exactly reversed from the tail-to-tail and head-to-tail cases.

Actually the head-to-head rule is more complicated than we have presented it here. It has an addendum. The full head-to-head rule of conditional independence states: If node is connected to the head of two directed edges, e.g. c in Figure 8, and it or any of its descendants are observed, then the two nodes at the edges' other ends, e.g. a and b in Figure 8, are dependent.

So, in summary, the tail-to-tail rule states that the two nodes a and b in Figure 3 are dependent unless c is observed in which case they become independent. The head-to-tail rule states that the two nodes a and b in Figure 5 are dependent unless c is observed in which case they become independent. Contrastingly, the head-to-head rule says the the two nodes a and b in Figure 7 are independent unless c, or any of its descendants, are observed in which case a and b become dependent. Generally if two, or more, nodes are conditionally independent according to these rules, the nodes are said to be d-seperated.

Markov Blanket

In a large graphical model we will often find it of use to determine the minimal set of nodes on which a given node depends. The set of such nodes is rather colorfully know as a Markov blanket. Let us derive the Markov blanket of the node x_i in Figure 9.

Figure 9

Figure 9

To examine which nodes are in the Markov blanket of x_i, we will look at the probability of x_i conditioned on all other  variables and see what nodes appear in that quantity. This set of nodes will be the Markov blanket of x_i.

Through use of the product rule of probability the probability of x_i conditioned on all other  variables can be made to appear as a factor in the joint distribution. Explicitly, the product rule of probability implies that the joint distribution from Figure 9 can be written as

p(x_1, \ldots, x_n) = p(x_i \mid x_1, \ldots, x_{i-1},x_{i+1}, \ldots, x_n) p(x_1, \ldots, x_{i-1},x_{i+1}, \ldots, x_n),

where the nodes in Figure 9 are x_1, \ldots, x_n. The sum rule of probability allows us to write the previous equation as 

p(x_1, \ldots, x_n) = p(x_i \mid x_1, \ldots, x_{i-1},x_{i+1}, \ldots, x_n) \sum \limits_{x_i} p(x_1, \ldots, x_i, \ldots,x_n).

The graphical model of Figure 9, however, allows us to write

p(x_1, \ldots, x_n) = \prod \limits_k p(x_k \mid pa_k),

where pa_k is the set of parent nodes of x_k. Thus, the last two equations imply

p(x_i \mid x_1, \ldots, x_{i-1},x_{i+1}, \ldots, x_n) =\frac{\prod \limits_k p(x_k \mid pa_k)}{\sum \limits_{x_i}\prod \limits_k p(x_k \mid pa_k)}.

 We can actually simplify this seemingly complicated equation, and cancel equivalent factors in the numerator and denominator, by simply observing a few properties of the graph.

First, the summation in the denominator is over x_i. Therefore, any terms that are independent of x_i can be pulled out of this sum and canceled with the corresponding terms in the numerator. Generically, a term p(x_k \mid pa_k) will be independent of x_i if i \neq k and x_i\not\in pk_k. So, rather obviously, the term p(x_i \mid pk_i)  can not be pulled out from the sum and any term p(x_k \mid pk_k) can not be pulled out if x_i \in pk_k. All other terms can be pulled out from the sum and canceled with the corresponding term from the numerator.

So, the only terms remaining are terms of the form p(x_i \mid pk_i) and terms of the form p(x_k \mid pk_k) where x_i \in pk_k. Thus, through p(x_i \mid pk_i) the random variable x_i depends upon its parents pk_i. Through p(x_k \mid pk_k) where x_i \in pk_k the random variable x_i depends upon all its children and all co-parents of its children, i.e. all parents of x_k other than x_i. This set of all nodes upon which x_i depends is its Markov blanket and is colored blue in Figure 9.

Markov Random Fields

Up until now we have considered only graphs with directed edges. One could also consider undirected edges too. However, it's not immediately apparent what a graph with undirected edges would correspond to. (Using undirected edges removes any notion of conditioning.) It turns out that one can actually make sense of undirected graphs, and doing so leads to a set of models called Markov random fields.

Conditional Independence

As one will remember, the head-to-head rule was a bit confusing. It seemed more or less the exact opposite of the tail-to-tail and head-to-tail rule. One might hope that the complexity of this head-to-head rule could be simplified if there was some way of "just making all of the dependency rules look the same." One way of doing this would be to use only undirected edges. Then, as there is no head or tail of an edge, all of the rules head-to-head, head-to-tail, and tail-to-tail will "look the same." In fact this intuition actually is correct, as we will now see.

Consider the undirected graph of Figure 10. Assume we are given three sets of random variables denoted by A, B, and C and asked to determine if A and B are conditionally independent given C. Doing so actually turns out to be easy.

Figure 10

Figure 10

One only needs to consider all paths from nodes in A to nodes in B. If all such paths contain at least one node in C, then A and B are conditionally independent given C. However, if at least one such path does not contain a node in C, then A and B need not be conditionally independent. This is much simpler than the directed case.

Markov Blanket

As a result of these conditional independence properties, determination of the Markov Blanket for a node in an undirected graph is relatively simple. For example in Figure 11

Figure 11

Figure 11

the Markov Blanket of the middle node is the set of nodes that share an edge with the middle node.

More generally, the Markov Blanket of  any node x_i in an undirected graph is the set of nodes that share an edge with x_i. Again, this is much simpler than the directed case.

Joint Distribution

Now we understand how conditional independence works for undirected graphs. However, we still haven't examined how to write the joint distribution corresponding to an undirected graph. In this section we will examine how this is done.

Generally, an undirected graph will correspond to some factorization of the joint probability; the question is: Which factorization? To answer this we will use what we already know about undirected graphs, how conditional independence works.

In an undirected graph consider two nodes x_i and x_j not connected by an edge. Now, for arguments sake, assume that all the other nodes \{x_1, \ldots, x_{i-1},x_{i+1},\ldots, x_{j-1},x_{j+1}, \ldots, x_n\} are observed. Thus, as x_i and x_j are not connected by an edge, all paths from x_i to x_j must contain at least one observed node. Hence, conditioned on all the observed nodes x_i and x_j are independent. Let us now see what this fact implies about the factorization of the joint distribution.

As we don't know how to factorize the joint distribution of an undirected graph, let's just guess an answer, then see what the guess implies. We will assume that the factorization of the joint distribution contains a factor of the form \psi(x_i,x_j,\ldots). Generically, such a term implies that x_i and x_j are not conditionally independent when all other nodes are observed, as generically such a term does not factor into two terms \psi(x_i,\ldots)\psi(x_j,\ldots). However, conditioned on all the observed nodes we have just found that x_i and x_j are independent. Thus, we know that such a term \psi(x_i,x_j,\ldots) can not appear in the factorization of the joint distribution. This leads to a general rule: A factor in a joint distribution contains only variables which are connected by an edge.

Though this general rule actually suffices for writing the joint distribution, we will find it convenient to introduce a few new concepts which simplify the process of writing the joint distribution. The first is a clique, a  clique is a set of nodes such that each pair of nodes in the set is connected by a edge. For example, in Figure 12 the set of nodes \{x_3, x_4\} forms a clique.

Figure 12

Figure 12

One can also define a maximal clique, a maximal clique is a clique in which no additional elements may be added while maintaining the clique property. For example, the set of nodes \{x_3, x_4\} is a clique, but not a maximal clique, as one can add node x_2 to the set while maintaining the clique property. However, \{x_2, x_3, x_4\} is a maximal clique, as one can not add x_1 to the set while maintaining the clique property as x_1 and x_4 are not connected by a edge.

So, as factors in a joint distribution contain only variables connected by an edge, we can use the partitioning of an undirected graph into maximal cliques to guide us in writing the joint distribution. As each node in a maximal clique is connected by an edge to any other element in the maximal clique, our general rule implies that the associated variables should appear in the same factor in the joint distribution. So, if we define MC as the set of maximal cliques, then the above allows us to write the joint distribution as 

p(x_1,\ldots,x_n) = \frac{1}{Z}\prod\limits_{C \in MC} \psi_C(C),

where the product is over all maximal cliques C and we have introduced Z, known as the partition function, to normalize the joint distribution.

Undirected vs Directed

So far we have examined directed graphs and undirected graphs. However, we have not examined the relation between a directed graph and an undirected graph. Here we will examine this relation.

Consider the directed graph in Figure 13a. From what we have seen, the joint distribution described by this directed graph is

p(x_1,\ldots,x_N) = p(x_1)p(x_2 \mid x_1)\cdots p(x_N \mid x_{N-1}).

As this is such a simple graph we can guess that the corresponding undirected graph is obtained by making all directed edges undirected, Figure 13b.

Figure 13a

Figure 13a

Figure 13b

Figure 13b

The joint distribution described by the undirected graph in Figure 13b is given by

p(x_1,\ldots,x_N) = \frac{1}{Z}\psi_{\{1,2\}}(x_1,x_2)\cdots\psi_{\{N-1,N\}}(x_{N-1},x_N).

Now, from the two previous equations, we can guess the relation between the various factors in the directed joint distribution and the undirected joint distribution. They are

\begin{aligned}\psi_{\{1,2\}}(x_1,x_2) &=p(x_1)p(x_2 \mid x_1)\\ &. \\ &.\\ &.\\\psi_{\{N-1,N\}}(x_{N-1},x_N) &=p(x_N \mid x_{N-1})\end{aligned}

Note that with these definitions of \psi_C(C) the partition function Z is identically 1, as the various probabilities involved are already normalized.

Let us move on to more complicated models. Consider the directed graph of Figure 14a. The joint distribution it describes is

p(x_1,x_2,x_3,x_4) =p(x_1)p(x_2)p(x_3)p(x_4 \mid x_1,x_2,x_3).

Here our previous tactic of making all directed edges into undirected edges breaks down. For example, doing so would lead to an undirected graph in which all factors containing x_4 would be

\psi_{\{x_1,x_4\}}(x_1,x_4)\psi_{\{x_2,x_4\}}(x_2,x_4)\psi_{\{x_3,x_4\}}(x_3,x_4).

Obviously this is not correct as there is no way that the conditional dependence of p(x_4 \mid x_1,x_2,x_3) is captured from such terms. However, the fix is not too hard to come by.

Figure 14a

Figure 14a

From the previous example we see that we need to create an undirected graph such that all of the parents of x_4 appear together in a maximal clique, as it is only then that a term containing x_1,  x_2, x_3, and x_4 will appear in the factorization of the undirected graph. To do this one starts with Figure 14a, then adds undirected edges between all parents of x_4; finally one makes all directed edges undirected. The resulting undirected graph is shown in Figure 14b.

Figure 14b

Figure 14b

This achieves our goal of putting x_1,  x_2, x_3, and x_4 all in a single maximal clique. The joint distribution described by Figure 14b is

p(x_1,x_2,x_3,x_4) = \frac{1}{Z} \psi_{\{x_1,x_2,x_3,x_4\}}(x_1,x_2,x_3,x_4).

This distribution, in contrast to our original guess, is actually correct. However, as one will note, some of the finer dependency information captured by the directed graph is not captured by the corresponding undirected graph; for example, the fact that the joint distribution has a p(x_1) factor is totally lost. So, this process of going from a directed graph to an undirected graph "destroys" information.

The general procedure of going from a directed graph to an undirected is only a slight generalization of what we did above. Given a directed graph one first finds all nodes with more than one parent. For each such node one one adds an undirected edge between all pairs of its parents. (This process of  "marrying parents"  goes by the name of moralization, an unfortunate nomenclature.)  Finally, one makes all remaining directed edges undirected. As to the mapping from conditional probabilities of the directed graph to the factors \psi_C(C), there will always be at least one valid mapping as a result of moralization.

Inference in Graphical Models

So far we have examined Bayesian networks and Markov random fields. For both we have examined how to go from a graphical model to a joint distribution, and we have examined how conditional independence works in both cases. However, we have not examined how to determine marginal or conditional probabilities from the joint distribution. Here we'll examine how to obtain marginal and conditional probabilities from the joint distribution in an efficient manner for both Bayesian networks and Markov random fields.

Inference on a Chain

We will start with the simplest case, that of determining the marginal distribution of a single random variable x_n in the joint distribution described by the graph in Figure 15. From the sum rule of probability we know

p(x_n)=\sum\limits_{x_1}\cdots\sum\limits_{x_{n-1}}\sum\limits_{x_{n+1}}\cdots\sum\limits_{x_N}p(x_1,\ldots,x_N).

From Figure 15

Figure 18

Figure 15

we know that the joint distribution is given by

p(x_1,\ldots,x_N)=\frac{1}{Z}\prod\limits_{i=1}^{N-1}\psi_{\{x_i,x_{i+1}\}}(x_i,x_{i+1}).

The last two equations thus imply

p(x_n)=\frac{1}{Z}\sum\limits_{x_1}\cdots\sum\limits_{x_{n-1}}\sum\limits_{x_{n+1}}\cdots\sum\limits_{x_N}\prod\limits_{i=1}^{N-1}\psi_{\{x_i,x_{i+1}\}}(x_i,x_{i+1}).

Generally this looks like a mess. However, a simple observation allows us to unravel this mess.

This simple observation is that multiplication distributes over addition

ab+ac=a(b+c).

So, if we have a sum of various terms such that each term contains a common factor, then we can pull this common factor outside the summation, as in the above simple example. Consider now applying this simple observation to our rather messy equation above.

Consider the summation over x_N. The variable x_N only appears in a single factor \psi_{\{x_{N-1},x_{N}\}}(x_{N-1},x_{N}). Thus, all the other factors in the product of the \psi_{\{x_i,x_{i+1}\}}(x_i,x_{i+1})'s are common factors relative to this sum; so, we can pull them outside the summation over x_N. Consider next the summation over x_{N-1}. The variable x_{N-1} appears in two terms, the summation mentioned above, through \psi_{\{x_{N-1},x_{N}\}}(x_{N-1},x_{N}), and also the factor \psi_{\{x_{N-2},x_{N-1}\}}(x_{N-2},x_{N-1}). All other factors in the product of the \psi_{\{x_i,x_{i+1}\}}(x_i,x_{i+1})'s are common factors relative to the sum over x_{N-1}; so, we can pull them outside the summation over x_{N-1}. This story continues with x_{N-2}, x_{N-3}, x_{N-4}... until one reaches x_n which is not summed over. Then the entire process starts again, but with x_1 and from the other side.

Consider the summation over x_1. The variable x_1 only appears in a single factor \psi_{\{x_{1},x_{2}\}}(x_{1},x_{2}). So, all other factors in the product of the \psi_{\{x_i,x_{i+1}\}}(x_i,x_{i+1})'s are common factors relative to this sum; thus, we can pull them outside the summation over x_1. Considering the summations over x_2,  x_3,  x_4... we see that we can play the same game. So, using all of these facts together, we see that we can write the marginal distribution over p(x_n) as follows 

\begin{aligned}p(x_n)&=\frac{1}{Z} \\ &\Biggl[\sum\limits_{x_{n-1}}\psi_{\{x_{n-1},x_{n}\}}(x_{n-1},x_{n})\cdots\Biggl[\sum\limits_{x_{2}}\psi_{\{x_{2},x_{3}\}}(x_{2},x_{3})\Biggl[\sum\limits_{x_{1}}\psi_{\{x_{1},x_{2}\}}(x_{1},x_{2})\Biggr]\Biggr]\cdots\Biggr] \\ &\Biggl[\sum\limits_{x_{n+1}}\psi_{\{x_{n},x_{n+1}\}}(x_{n},x_{n+1})\cdots\Biggl[\sum\limits_{x_{N-1}}\psi_{\{x_{N-2},x_{N-1}\}}(x_{N-2},x_{N-1})\Biggl[\sum\limits_{x_{N}}\psi_{\{x_{N-1},x_{N}\}}(x_{N-1},x_{N})\Biggr]\Biggr]\cdots\Biggr]\end{aligned}

This is simpler, but still somewhat of a mess. We can clean it up using the following definitions 

\begin{aligned}\mu_\alpha(x_n) &=\Biggl[\sum\limits_{x_{n-1}}\psi_{\{x_{n-1},x_{n}\}}(x_{n-1},x_{n})\cdots\Biggl[\sum\limits_{x_{2}}\psi_{\{x_{2},x_{3}\}}(x_{2},x_{3})\Biggl[\sum\limits_{x_{1}}\psi_{\{x_{1},x_{2}\}}(x_{1},x_{2})\Biggr]\Biggr]\cdots\Biggr] \\\mu_\beta(x_n) &=\Biggl[\sum\limits_{x_{n+1}}\psi_{\{x_{n},x_{n+1}\}}(x_{n},x_{n+1})\cdots\Biggl[\sum\limits_{x_{N-1}}\psi_{\{x_{N-2},x_{N-1}\}}(x_{N-2},x_{N-1})\Biggl[\sum\limits_{x_{N}}\psi_{\{x_{N-1},x_{N}\}}(x_{N-1},x_{N})\Biggr]\Biggr]\cdots\Biggr].\end{aligned}

These allow us to write p(x_n) in the simple form 

p(x_n)=\frac{1}{Z}\mu_\alpha(x_n)\mu_\beta(x_n).

This is much better.

Beyond allowing us to express p(x_n) in a compact form, the factors \mu_\alpha(x_n) and \mu_\beta(x_n) allow us to interpret the calculation of p(x_n) in a very fruitful manner. Staring at the definition of \mu_\alpha(x_n), one sees that it can be written recursively

\mu_\alpha(x_{n})=\sum\limits_{x_{n-1}}\psi_{\{x_{n-1},x_{n}\}}(x_{n-1},x_{n})\mu_\alpha(x_{n-1}).

This suggests that we can interpret \mu_\alpha(x_i) as a message being passed from lower numbered nodes to higher numbered nodes, where at each node the equation above is applied to obtain the subsequent \mu_\alpha(x_i). The initial message in this chain is then given by 

\mu_\alpha(x_2)=\sum\limits_{x_{1}}\psi_{\{x_{1},x_{2}\}}(x_{1},x_{2}).

Similarly, \mu_\beta(x_n) also admits a recursive definition 

\mu_\beta(x_{n})=\sum\limits_{x_{n+1}}\psi_{\{x_{n},x_{n+1}\}}(x_{n},x_{n+1})\mu_\beta(x_{n+1}).

 This suggests that we can interpret \mu_\beta(x_i) as a message being passed but from higher numbered nodes to lower numbered nodes, where at each node the above equation is applied to obtain the subsequent \mu_\beta(x_i). In this chain the initial message is given by 

\mu_\beta(x_{N})=\sum\limits_{x_{N-1}}\psi_{\{x_{N-1},x_{N}\}}(x_{N-1},x_{N})

So, in summary, the recursive message passing of \mu_\alpha(x_i) and \mu_\beta(x_i) along this chain can be depicted as in Figure 16.

Figure 19

Figure 16

So we have found that, for the simple case of the undirected graph of Figure 15, we can now marginalize to obtain the distribution p(x_n) for any node x_n in the graph. Next we will see how to extend this result to an arbitrary tree, i.e. a graph in which there is one and only one path between any pair of nodes. However, before we are able to do so, we must take a detour and examine so called "factor graphs."

Factor Graphs

Both the directed graphs of Bayesian networks and the undirected graphs of Markov random fields are used to express joint probabilities as a products over various factors. Generically, both types of graphs yield a joint distribution of the general form

p(x_1,\ldots,x_N)=\prod\limits_{X_s\in S}f_{X_s}(X_s),

where S is a set of subsets of \{x_1,\ldots,x_N\}. A "factor graph" is simply another type of graph which expresses such a factorization.

Explicitly, a factor graph is a graph with a node, depicted using a circle,  for every variable in the distribution and with a node, depicted using a square, for every factor f_s(X_s) in the distribution. It also contains undirected links which link each factor node to the variable nodes on which the factor depends. So, for example, the factor node corresponding to f_s(X_s) will have undirected links to all the variable nodes corresponding to variables in the set X_s.

As an example, consider the joint distribution

p(x_1,x_2,x_3)=f_a(x_1,x_2)f_b(x_1,x_2)f_c(x_2,x_3)f_d(x_3).

The factor graph corresponding to this joint distribution is depicted in Figure 17.

Figure 20

Figure 17

Figure 17 has a circular node for each of the variables x_1, x_2, and x_3, and it has a square node for each of the factors f_a,  f_b,  f_c,  and f_d. In addition, it has undirected edges connecting each factor node to the variable nodes on which the factor depends, e.g. the factor node of f_a(x_1,x_2) has undirected edges to the variable nodes for x_1 and x_2.

Given the undirected graph of a Markov random field, we can produce a factor graph that represents the same distribution. One first introduces a variable node for each  variable node of the undirected graph, then for each clique of the undirected graph one introduces a factor node. Finally, one connects the factor nodes to the variable nodes in the factor node's clique. The factors f_{X_s}(X_s) then are simply set equal to the corresponding clique factors. For example, using these rules the undirected graph of Figure 18a is converted to the factor graph of Figure 18b. However, one should also note that factor graphs are more "flexible" than the undirected graphs of a Markov random field. For example, the undirected graph of Figure 18a could also correspond to the factor graph of Figure 18c but not using the rules described above.

Figure 21a

Figure 18a

Figure 21b

Figure 18b

Figure 21c

Figure 18c

Given the directed graph of a Bayesian network, we can also produce a factor graph that represents the same distribution. First, one creates a variable node for each variable node of the directed graph, then one introduces a factor node for each of the directed graph's conditional distributions. Finally, one adds undirected links linking each factor node to the variable nodes in the factor node's conditional distribution. The factors f_{X_s}(X_s) then are simply set equal to the corresponding conditional distribution. For example, using these rules the directed graph of Figure 19a results in the factor graph of Figure 19b. As in the case of undirected graphs, there may be more than one factor graph corresponding to a given directed graph. For example, the directed graph of Figure 19a also corresponds to the factor graph of Figure 19c, however not using the rules above.

Figure 22a

Figure 19a

Figure 22b

Figure 19b

Figure 22c

Figure 19c

Sum Product Algorithm

Next we will examine how to employ factor graphs to derive marginal distributions from directed or undirected tree-structured graphs. We will derive the so-called "sum-product algorithm," which computes such marginal distributions.

If we are given a directed or undirected tree-structured graph describing a joint distribution p(x_1,\ldots,x_N), then we can, using the rules we established in the previous section, convert the graph into a factor graph that describes the same joint distribution. Using this factor graph, we can then write this joint distribution as a product of various factors, each of which corresponds to a factor node.

Now, one can easily prove that as we stared with a tree-structured graph our factor graph is also tree-structured. Thus, each factor node f_s attached to a variable node x, say, roots a subtree of our original factor graph. So, if we denote the product of all factors in subtree of f_s as F_s(x,X_s), where X_s denotes all variables other than x in this subtree, then we can write the joint distribution p(x_1,\ldots,x_N) as follows

p(x_1,\ldots,x_N)=\prod\limits_{s\in ne(x)}F_s(x,X_s),

where ne(x) is the set of all factor nodes that are neighbors of x.

Now, if we wish to compute the marginal distribution p(x) from the joint distribution p(x_1,\ldots,x_N), then we can simply employ the sum rule of probability

p(x)=\sum\limits_{\{x_i:x_i\ne x\}}p(x_1,\ldots,x_N).

However, if we use the joint distribution, as derived from our analysis of the factor graph, the marginal distribution can be written as follows 

p(x)=\sum\limits_{\{x_i:x_i\ne x\}}\Bigl[\prod\limits_{s\in ne(x)}F_s(x,X_s)\Bigr].

As the sum is over all x_i such that x_i\ne x, the sum is really over all of the various X_s, as these contain all variables in the factor graph that are not x. So, we can interchange the sum and product, as we did to derive the marginals for Figure 15, to obtain the following form of the marginal distribution 

p(x)=\prod\limits_{s\in ne(x)}\Bigl[\sum\limits_{X_s}F_s(x,X_s)\Bigr],

where the sum is now taken over all variables X_s in the subtree rooted at f_s that are not x. If we define \mu_{f_s\rightarrow x}(x) through

\mu_{f_s\rightarrow x}(x)=\sum\limits_{X_s}F_s(x,X_s),

then we can write the marginal distribution p(x) as follows 

p(x)=\prod\limits_{s\in ne(x)}\mu_{f_s\rightarrow x}(x).

This allows us to view the marginal distribution p(x) as resulting from the product of all messages \mu_{f_s\rightarrow x}(x) that "flow" to x from the factor nodes f_s connected to x by an undirected  edge. This situation is depicted in Figure 20.

Figure 23

Figure 20

Next let us consider in a bit more detail the form of F_s(x,X_s). As previously mentioned, F_s(x,X_s) is the product of all factors in the subtree rooted by f_s. However, as this subtree is also a tree, we can use this tree structure to decompose F_s(x,X_s) into its factors. In particular, as f_s roots its subtree, f_s is a factor of F_s(x,X_s). Furthermore, as our graph is tree structured, each variable node x_m attached to the factor node f_s roots a subtree in our factor graph. Thus, F_s(x,X_s) contains a factor G_m(x_m,X_{s_m}),  which is the product of all factors in the subtree rooted at x_m, for each such variable node x_m. (Of course there is no G(x,X_s) factor as x has already been accounted for.) With this, we can write F_s(x,X_s) as follows

F_s(x,X_s)=f(x,x_1,\ldots,x_M)G_1(x_1,X_{s_1})\cdots\,G_M(x_M,X_{s_M}),

where the variable nodes, other than x, attached to f_s are denoted by x_1,\ldots,x_M. This expression for F_s(x,X_s) allows us to write \mu_{f_s\rightarrow x}(x) in a more informative manner

\mu_{f_s\rightarrow x}(x)=\sum\limits_{x_1}\cdots\sum\limits_{x_M}f_s(x,x_1,\ldots,x_M)\prod\limits_{m\in \{ne(f_s)-x\}}\Bigl[\sum\limits_{X_{s_m}}G_m(x_m,X_{s_m})\Bigr],

where ne(f_s) is the set of all variable nodes that are neighbors of f_s and we have, for the same reasons as before, exchanged the sum and product. Now, if we define \mu_{x_m\rightarrow f_s}(x_m) as follows

\mu_{x_m\rightarrow f_s}(x_m)=\sum\limits_{X_{s_m}}G_m(x_m,X_{s_m}),

then we can rewrite \mu_{f_s\rightarrow x}(x) more informatively as 

\mu_{f_s\rightarrow x}(x)=\sum\limits_{x_1}\cdots\sum\limits_{x_M}f_s(x,x_1,\ldots,x_M)\prod\limits_{m\in \{ne(f_s)-x\}}\mu_{x_m\rightarrow f_s}(x_m).

With this we can view \mu_{f_s\rightarrow x}(x) as the product of various messages \mu_{x_m\rightarrow f_s}(x_m), from the variable nodes x_m to f_s, along with the factor f_s itself. Finally, all of this is "marginalized" by summing over x_1,\ldots,x_M. This is depicted in Figure 21.

Figure 24

Figure 21

Now let us consider the factor G_m(x_m,X_{s_m}) in a bit more detail. By definition G_m(x_m,X_{s_m}) is the product of all factors in the subtree rooted by x_m. The subtree rooted by x_m consists of "sub-subtrees" each of which is rooted by a factor node f_l which is connected directly to x_m by an edge. This situation is depicted in Figure 22.

Figure 25

Figure 22

From our previous calculations we know that each of these "sub-subtrees," produces a factor F_l(x_m,X_{m_l}). Thus, G_m(x_m,X_{s_m}) can be expressed as follows

G_m(x_m,X_{s_m})=\prod\limits_{l\in\{ne(x_m)-f_s\}}F_l(x_m,X_{m_l}).

This allows us to express \mu_{x_m\rightarrow f_s}(x_m) in terms of F_l(x_m,X_{m_l}) as follows 

\mu_{x_m\rightarrow f_s}(x_m)=\prod\limits_{l\in\{ne(x_m)-f_s\}}\Bigl[\sum\limits_{X_{m_l}}F_l(x_m,X_{m_l})\Bigr],

where we have, for the reasons outlined previously, exchanged the sum and product operations. Using the definition of \mu_{f_s\rightarrow x}(x) the previous equation can be written as

\mu_{x_m\rightarrow f_s}(x_m)=\prod\limits_{l\in\{ne(x_m)-f_s\}}\mu_{f_l\rightarrow x_m}(x_m).

This can be interpreted as stating that the message the variable node x_m sends to the factor node f_s is the product of all the messages x_m receives from all other adjoining factor nodes.

With this last result in concert with our previous results, we now know how to pass messages along our factor graph, as we know both how to handle these messages at variable nodes and at factor nodes. The only remaining point to clarify is how the message passing process starts.

As the messages are all eventually passed to the root node, the node we want the marginal distribution of, the obvious place the messages have to start are at the leaves of the tree. The leaves of this tree can either be variable nodes or factor nodes. So, we have to handle both cases. If a leaf is a variable node x, then the message it passes to its parent factor node f is simply

\mu_{x\rightarrow f}(x)=1.

This is depicted in Figure 23a.

Figure 26a

Figure 23a

For the case of a leaf factor node f, the message it passes to its parent variable node x is simply 

\mu_{f\rightarrow x}(x)=f(x).

This is depicted in Figure 23b.

Figure 26b

Figure 23b

So, in summary, to evaluate the marginal distribution p(x) of a variable described by a directed or undirected tree structured graph, we first use the procedure we established previously to convert the graph into a factor graph. Then we can view the factor graph as a tree rooted on the variable node x. Starting at the leaves of this tree we pass leaf messages, either \mu_{x\rightarrow f}(x)=1 or \mu_{f\rightarrow x}(x)=f(x), from the leaves to their parents. We then use the two methods we previously derived to pass messages across factor nodes 

\mu_{f_s\rightarrow x}(x)=\sum\limits_{x_1}\cdots\sum\limits_{x_M}f_s(x,x_1,\ldots,x_M)\prod\limits_{m\in \{ne(f_s)-x\}}\mu_{x_m\rightarrow f_s}(x_m)

and across variable nodes 

\mu_{x_m\rightarrow f_s}(x_m)=\prod\limits_{l\in\{ne(x_m)-f_s\}}\mu_{f_l\rightarrow x_m}(x_m)

to propagate the messages from the leaves to the root x. At the root node we then employ 

p(x)=\prod\limits_{s\in ne(x)}\mu_{f_s\rightarrow x}(x)

to determine the final marginal distribution.

Now, if we wanted to compute the marginal distribution for the variable nodes x_i and x_j, then we could simply use the algorithm above two times, once with x_i as root and once with x_j as root. However, there is a more efficient algorithm that's much simpler. One can, just as we did above, view x as a root and pass messages from the leaves up to the root x, this will suffice to compute the marginal p(x). However, one can now pass messages in the other direction, from the root to the leaves. Upon all the leaves receiving their messages, each variable node x_k will have obtained messages of the form \mu_{f_s\rightarrow x_k}(x_k) from all neighboring factor nodes. This will allow us to calculate the marginal p(x_k) as follows 

p(x_k)=\prod\limits_{s\in ne(x_k)}\mu_{f_s\rightarrow x_k}(x_k).

So, with only twice the computation involved in computing p(x), we can compute p(x_k) for any variable x_k in the factor graph!

Finally, we can also use the above construction to compute conditional distributions. Assume the variables of the factor graph are partitioned into observed variables x_i and a latent variable z and we want to find the distribution of z conditioned on the x_i, in other words p(z\mid \hat{x}_1,\ldots,\hat{x}_M) where the observed value of x_i is \hat{x}_i. To do so, we first examine all the factors in the factor graph. If a factor f_s(\ldots,x_i,\ldots) is a function of x_i, then we simply multiply the factor f_s(\ldots,x_i,\ldots) by I(x_i,\hat{x}_i), which is 1 if x_i=\hat{x}_i and is 0 otherwise. After updating the factors for all the observed variables x_i, one simply runs the above algorithm. Instead of computing p(z) it now computes p(z\mid \hat{x}_1,\ldots,\hat{x}_M) up to normalization, due to the introduction of the I(x_i,\hat{x}_i) factors.

Leave a comment: