Here at forty.to we are starting to employ "graphical models" into our tool chain. So, to spread the gospel, and hopefully help anyone deciding to get into "graphical models", we decided to write a two part blog post on "graphical models". This first part covers the basics of "graphical model" theory. The second part will cover the practice of using "graphical models".

As we want to keep the post to a manageable length we, unfortunately, have to make some assumptions of the reader. We assume the reader is familiar with calculus, probability distributions, conditional probability distributions, and the sum and product rules of probability. One doesn't have to have a very deep understanding of any of these topics; a cursory understanding should suffice.

So, with these preliminaries out of the way, lets jump in!

## Why Graphical Models?

Generally humans find reasoning about and manipulating graphical representations much easier than reasoning about and manipulating equations. So, anytime one can convert manipulations of equations to manipulations of graphical diagrams, it's a win.

One need not look far for examples of just this logic. Feynman diagrams

revolutionized physics, allowing physicists to reason about and manipulate these graphical representations instead of equations. Knot diagrams

allowed mathematicians to reason about knots through simple manipulation of these funny little pictures. More exotically, Kirby Diagrams

allow mathematicians to reason about four dimensional spaces through manipulation of diagrams like that above.

So, it seems only logical to look for a graphical representation of a probability distribution that allows one to easily manipulate and reason about that distribution. Graphical models provide just such a representation.

## Bayesian Networks

Consider a probability distribution over three random variables and . Using the product rule of probability twice, we can rewrite as follows

Now we will introduce a graphic form for the right hand side of this equation.

For each random variable we introduce a node and we associate each node with the conditional probability for its variable. Then for each conditional probability, i.e. node, and for each corresponding conditioning variable we add a directed edge from the node corresponding to the conditioning variable to the node corresponding to the conditioned variable.

So, the right hand side of the equation is graphically represented as in Figure 1. (Note Figure 1, and all subsequent figures, were created by Christopher M. Bishop.)

The distribution is not conditioned; thus, there are no directed edges going in to node . The distribution is conditioned on ; thus, there is a directed edge from node to node . The distribution is conditioned on and ; thus, there is a directed edge from node to node and a directed edge from node to node .

So, as the rules for this game are relatively simple, we can explore more complicated diagrams. For example the distribution described by Figure 2

has the form

One can form an infinite number of such graphs, nodes representing random variables and directed edges conditioning. Generically such a graph goes under the name of a

*Bayesian network,*and in the following sections we will study

*Bayesian networks*in some detail.

### Conditional Independence

Beyond just looking pretty these graphical models also help one reason about "independence" and "conditional independence".

As you will recall, two random variables and are *independent* if the following equation holds

This notion of independence can be generalized to the case of conditional distributions as well. Given the random variable two random variables and are

*conditionally independent*if the following equation holds

Often one is given a joint probability distribution and one needs to determine if two, or more, random variables in the joint distribution are independent or conditional independent. Given a graphical model for the distribution, this question can be answered relatively easily, as we will show below.

#### Tail to Tail

First consider the probability distribution depicted by the graphical model of Figure 3.

Using what we have discussed previously, it has the form

Let us assume that is not observed and, in the context of this assumption, examine whether and are independent. Using the sum rule of probability we can write the distribution as

Substituting the form of the distribution gleaned from the graphical model, we have

So, generically this distribution does not factor into a product of and . Hence, and are not generically independent.

Consider now the case in which is observed. In our graphical model we will denote this by coloring the node in, as in Figure 4.

Again, using what we have learned about graphical models, the joint distribution is

In addition, the product rule of probability implies

Together these last two equations imply

where we have used the fact that , as was observed, to divide through by . This last equation implies that if is observed, then and are conditionally independent.

Generally, the situation depicted in Figure 4 gives rise to the *tail-to-tail rule* of conditional independence. It states that if an observed node is connected to the tails of two directed edges, e.g. in Figure 4, then the two nodes at directed edges' heads, e.g. and in Figure 4, are independent. This rule simply formalizes what we discovered above.

#### Head to Tail

Consider the probability distribution depicted by the graphical model of Figure 5.

Its joint probability distribution can be written as follows

Let us assume that is not observed and, with this assumption, examine whether and are independent. Using the sum rule of probability we can write the distribution as follows

Using the product rule we have . The sum rule then implies that the above equation can be written as

Generically this implies that and are not independent if is not observed.

Consider now the case in which is observed; this is depicted in Figure 6.

Again, from our study of graphical models, we know the joint probability distribution can be written as follows

The product rule of probability implies

So, together these imply

where we have used the fact that as has been observed. However, Bayes Rule states that

So, together that last two equations imply

This simply states that if is observed, then and are conditionally independent.

The situation depicted in Figure 6 is that of the *head-to-tail rule* of conditional independence. If an observed node is connected to the head of one directed edge and the tail of another, e.g. in Figure 6, then the two nodes at the edges' other ends, e.g. and in Figure 6, are independent.

As a mnemonic for the tail-to-tail rule and the head-to-tail rule, one can think of "influence" as "flowing" from node to node and from node to node along the various edges that pass through When is observed, and subsequently its node is colored in, this "influence" is "blocked" and thus and are independent as a result.

#### Head to Head

The tail-to-tail rule and the head-to-tail rule were relatively simple. We will now examine a more complicated rule which is derived from the distribution described in Figure 7.

The joint distribution described by Figure 7 is given by

We can use the sum rule of probability to determine the distribution . It is given by

where for the last equality we have used the fact that is a probability distribution and thus the sum over is Obviously, this implies that and are independent. Note, this is exactly opposite of the tail-to-tail and head-to-tail cases where, if was not observed, and were not independent.

Consider now the case in which is observed. This corresponds to Figure 8.

The joint distribution described by Figure 8 is given by

The product rule of probability implies

Hence, together these imply

where we have used the fact that which follows from the fact that was observed. Now, generically this does not imply that and are conditionally independent given .

The situation depicted in Figure 8 is the *head-to-head rule* of conditional independence. If an observed node is connected to the head of two directed edges, e.g. in Figure 8, then the two nodes at the edges' other ends, e.g. and in Figure 8, are dependent. So, the situation here is exactly reversed from the tail-to-tail and head-to-tail cases.

Actually the head-to-head rule is more complicated than we have presented it here. It has an addendum. The full *head-to-head rule* of conditional independence states: If node is connected to the head of two directed edges, e.g. in Figure 8, and it *or any of its **descendants* are observed, then the two nodes at the edges' other ends, e.g. and in Figure 8, are dependent.

So, in summary, the tail-to-tail rule states that the two nodes and in Figure 3 are dependent unless is observed in which case they become independent. The head-to-tail rule states that the two nodes and in Figure 5 are dependent unless is observed in which case they become independent. Contrastingly, the head-to-head rule says the the two nodes and in Figure 7 are independent unless , or any of its descendants, are observed in which case and become dependent. Generally if two, or more, nodes are conditionally independent according to these rules, the nodes are said to be *d-seperated*.

### Markov Blanket

In a large graphical model we will often find it of use to determine the minimal set of nodes on which a given node depends. The set of such nodes is rather colorfully know as a *Markov blanket*. Let us derive the Markov blanket of the node in Figure 9.

To examine which nodes are in the Markov blanket of we will look at the probability of conditioned on all other variables and see what nodes appear in that quantity. This set of nodes will be the Markov blanket of

Through use of the product rule of probability the probability of conditioned on all other variables can be made to appear as a factor in the joint distribution. Explicitly, the product rule of probability implies that the joint distribution from Figure 9 can be written as

where the nodes in Figure 9 are The sum rule of probability allows us to write the previous equation as

The graphical model of Figure 9, however, allows us to write

where is the set of parent nodes of . Thus, the last two equations imply

We can actually simplify this seemingly complicated equation, and cancel equivalent factors in the numerator and denominator, by simply observing a few properties of the graph.

First, the summation in the denominator is over . Therefore, any terms that are independent of can be pulled out of this sum and canceled with the corresponding terms in the numerator. Generically, a term will be independent of if and So, rather obviously, the term can not be pulled out from the sum and any term can not be pulled out if All other terms can be pulled out from the sum and canceled with the corresponding term from the numerator.

So, the only terms remaining are terms of the form and terms of the form where Thus, through the random variable depends upon its parents Through where the random variable depends upon all its children and all *co-parents* of its children, i.e. all parents of other than . This set of all nodes upon which depends is its Markov blanket* *and is colored blue in Figure 9.

## Markov Random Fields

Up until now we have considered only graphs with directed edges. One could also consider undirected edges too. However, it's not immediately apparent what a graph with undirected edges would correspond to. (Using undirected edges removes any notion of conditioning.) It turns out that one can actually make sense of undirected graphs, and doing so leads to a set of models called *Markov random fields*.

### Conditional Independence

As one will remember, the head-to-head rule was a bit confusing. It seemed more or less the exact opposite of the tail-to-tail and head-to-tail rule. One might hope that the complexity of this head-to-head rule could be simplified if there was some way of "just making all of the dependency rules look the same." One way of doing this would be to use only undirected edges. Then, as there is no head or tail of an edge, all of the rules head-to-head, head-to-tail, and tail-to-tail will "look the same." In fact this intuition actually is correct, as we will now see.

Consider the undirected graph of Figure 10. Assume we are given three sets of random variables denoted by and and asked to determine if and are conditionally independent given Doing so actually turns out to be easy.

One only needs to consider all paths from nodes in to nodes in . If all such paths contain at least one node in , then and are conditionally independent given However, if at least one such path does not contain a node in then and need not be conditionally independent. This is much simpler than the directed case.

### Markov Blanket

As a result of these conditional independence properties, determination of the *Markov Blanket* for a node in an undirected graph is relatively simple. For example in Figure 11

the *Markov Blanket *of the middle node is the set of nodes that share an edge with the middle node.

More generally, the *Markov Blanket *of any node in an undirected graph is the set of nodes that share an edge with Again, this is much simpler than the directed case.

### Joint Distribution

Now we understand how conditional independence works for undirected graphs. However, we still haven't examined how to write the joint distribution corresponding to an undirected graph. In this section we will examine how this is done.

Generally, an undirected graph will correspond to *some* factorization of the joint probability; the question is: Which factorization? To answer this we will use what we already know about undirected graphs, how conditional independence works.

In an undirected graph consider two nodes and not connected by an edge. Now, for arguments sake, assume that all the other nodes are observed. Thus, as and are not connected by an edge, all paths from to must contain at least one observed node. Hence, conditioned on all the observed nodes and are independent. Let us now see what this fact implies about the factorization of the joint distribution.

As we don't know how to factorize the joint distribution of an undirected graph, let's just guess an answer, then see what the guess implies. We will assume that the factorization of the joint distribution contains a factor of the form . Generically, such a term implies that and are not conditionally independent when all other nodes are observed, as generically such a term does not factor into two terms However, conditioned on all the observed nodes we have just found that and are independent. Thus, we know that such a term can not appear in the factorization of the joint distribution. This leads to a general rule: A factor in a joint distribution contains only variables which are connected by an edge.

Though this general rule actually suffices for writing the joint distribution, we will find it convenient to introduce a few new concepts which simplify the process of writing the joint distribution. The first is a *clique*, a clique is a set of nodes such that each pair of nodes in the set is connected by a edge. For example, in Figure 12 the set of nodes forms a clique.

One can also define a *maximal clique*, a maximal clique is a clique in which no additional elements may be added while maintaining the clique property. For example, the set of nodes is a clique, but not a maximal clique, as one can add node to the set while maintaining the clique property. However, is a maximal clique, as one can not add to the set while maintaining the clique property as and are not connected by a edge.

So, as factors in a joint distribution contain only variables connected by an edge, we can use the partitioning of an undirected graph into maximal cliques to guide us in writing the joint distribution. As each node in a maximal clique is connected by an edge to any other element in the maximal clique, our general rule implies that the associated variables should appear in the same factor in the joint distribution. So, if we define as the set of maximal cliques, then the above allows us to write the joint distribution as

where the product is over all maximal cliques and we have introduced known as the

*partition function*, to normalize the joint distribution.

### Undirected vs Directed

So far we have examined directed graphs and undirected graphs. However, we have not examined the relation between a directed graph and an undirected graph. Here we will examine this relation.

Consider the directed graph in Figure 13a. From what we have seen, the joint distribution described by this directed graph is

As this is such a simple graph we can guess that the corresponding undirected graph is obtained by making all directed edges undirected, Figure 13b.

The joint distribution described by the undirected graph in Figure 13b is given by

Now, from the two previous equations, we can guess the relation between the various factors in the directed joint distribution and the undirected joint distribution. They are

Note that with these definitions of the partition function is identically , as the various probabilities involved are already normalized.

Let us move on to more complicated models. Consider the directed graph of Figure 14a. The joint distribution it describes is

Here our previous tactic of making all directed edges into undirected edges breaks down. For example, doing so would lead to an undirected graph in which all factors containing would be

Obviously this is not correct as there is no way that the conditional dependence of is captured from such terms. However, the fix is not too hard to come by.

From the previous example we see that we need to create an undirected graph such that all of the parents of appear together in a maximal clique, as it is only then that a term containing and will appear in the factorization of the undirected graph. To do this one starts with Figure 14a, then adds undirected edges between all parents of finally one makes all directed edges undirected. The resulting undirected graph is shown in Figure 14b.

This achieves our goal of putting and all in a single maximal clique. The joint distribution described by Figure 14b is

This distribution, in contrast to our original guess, is actually correct. However, as one will note, some of the finer dependency information captured by the directed graph is not captured by the corresponding undirected graph; for example, the fact that the joint distribution has a factor is totally lost. So, this process of going from a directed graph to an undirected graph "destroys" information.

The general procedure of going from a directed graph to an undirected is only a slight generalization of what we did above. Given a directed graph one first finds all nodes with more than one parent. For each such node one one adds an undirected edge between all pairs of its parents. (This process of "marrying parents" goes by the name of *moralization*, an unfortunate nomenclature.) Finally, one makes all remaining directed edges undirected. As to the mapping from conditional probabilities of the directed graph to the factors , there will always be at least one valid mapping as a result of moralization.

## Inference in Graphical Models

So far we have examined Bayesian networks and Markov random fields. For both we have examined how to go from a graphical model to a joint distribution, and we have examined how conditional independence works in both cases. However, we have not examined how to determine marginal or conditional probabilities from the joint distribution. Here we'll examine how to obtain marginal and conditional probabilities from the joint distribution in an *efficient* manner for both Bayesian networks and Markov random fields.

### Inference on a Chain

We will start with the simplest case, that of determining the marginal distribution of a single random variable in the joint distribution described by the graph in Figure 15. From the sum rule of probability we know

From Figure 15

we know that the joint distribution is given by

The last two equations thus imply

Generally this looks like a mess. However, a simple observation allows us to unravel this mess.

This simple observation is that multiplication distributes over addition

So, if we have a sum of various terms such that each term contains a common factor, then we can pull this common factor outside the summation, as in the above simple example. Consider now applying this simple observation to our rather messy equation above.

Consider the summation over . The variable only appears in a single factor . Thus, all the other factors in the product of the 's are common factors relative to this sum; so, we can pull them outside the summation over . Consider next the summation over . The variable appears in two terms, the summation mentioned above, through , and also the factor . All other factors in the product of the 's are common factors relative to the sum over ; so, we can pull them outside the summation over . This story continues with , , ... until one reaches which is not summed over. Then the entire process starts again, but with and from the other side.

Consider the summation over . The variable only appears in a single factor . So, all other factors in the product of the 's are common factors relative to this sum; thus, we can pull them outside the summation over . Considering the summations over , , ... we see that we can play the same game. So, using all of these facts together, we see that we can write the marginal distribution over as follows

This is simpler, but still somewhat of a mess. We can clean it up using the following definitions

These allow us to write in the simple form

This is much better.

Beyond allowing us to express in a compact form, the factors and allow us to interpret the calculation of in a very fruitful manner. Staring at the definition of , one sees that it can be written recursively

This suggests that we can interpret as a message being passed from lower numbered nodes to higher numbered nodes, where at each node the equation above is applied to obtain the subsequent . The initial message in this chain is then given by

Similarly, also admits a recursive definition

This suggests that we can interpret as a message being passed but from higher numbered nodes to lower numbered nodes, where at each node the above equation is applied to obtain the subsequent . In this chain the initial message is given by

So, in summary, the recursive message passing of and along this chain can be depicted as in Figure 16.

So we have found that, for the simple case of the undirected graph of Figure 15, we can now marginalize to obtain the distribution for any node in the graph. Next we will see how to extend this result to an arbitrary tree, i.e. a graph in which there is one and only one path between any pair of nodes. However, before we are able to do so, we must take a detour and examine so called "factor graphs."

### Factor Graphs

Both the directed graphs of Bayesian networks and the undirected graphs of Markov random fields are used to express joint probabilities as a products over various factors. Generically, both types of graphs yield a joint distribution of the general form

where is a set of subsets of A "factor graph" is simply another type of graph which expresses such a factorization.

Explicitly, a *factor graph* is a graph with a node, depicted using a circle, for every variable in the distribution *and* with a node, depicted using a square, for every factor in the distribution. It also contains undirected links which link each *factor node* to the variable nodes on which the factor depends. So, for example, the factor node corresponding to will have undirected links to all the variable nodes corresponding to variables in the set

As an example, consider the joint distribution

The factor graph corresponding to this joint distribution is depicted in Figure 17.

Figure 17 has a circular node for each of the variables and and it has a square node for each of the factors and In addition, it has undirected edges connecting each factor node to the variable nodes on which the factor depends, e.g. the factor node of has undirected edges to the variable nodes for and

Given the undirected graph of a Markov random field, we can produce a factor graph that represents the same distribution. One first introduces a variable node for each variable node of the undirected graph, then for each clique of the undirected graph one introduces a factor node. Finally, one connects the factor nodes to the variable nodes in the factor node's clique. The factors then are simply set equal to the corresponding clique factors. For example, using these rules the undirected graph of Figure 18a is converted to the factor graph of Figure 18b. However, one should also note that factor graphs are more "flexible" than the undirected graphs of a Markov random field. For example, the undirected graph of Figure 18a could also correspond to the factor graph of Figure 18c but not using the rules described above.

Given the directed graph of a Bayesian network, we can also produce a factor graph that represents the same distribution. First, one creates a variable node for each variable node of the directed graph, then one introduces a factor node for each of the directed graph's conditional distributions. Finally, one adds undirected links linking each factor node to the variable nodes in the factor node's conditional distribution. The factors then are simply set equal to the corresponding conditional distribution. For example, using these rules the directed graph of Figure 19a results in the factor graph of Figure 19b. As in the case of undirected graphs, there may be more than one factor graph corresponding to a given directed graph. For example, the directed graph of Figure 19a also corresponds to the factor graph of Figure 19c, however not using the rules above.

### Sum Product Algorithm

Next we will examine how to employ factor graphs to derive marginal distributions from directed or undirected tree-structured graphs. We will derive the so-called "sum-product algorithm," which computes such marginal distributions.

If we are given a directed or undirected tree-structured graph describing a joint distribution then we can, using the rules we established in the previous section, convert the graph into a factor graph that describes the same joint distribution. Using this factor graph, we can then write this joint distribution as a product of various factors, each of which corresponds to a factor node.

Now, one can easily prove that as we stared with a tree-structured graph our factor graph is also tree-structured. Thus, each factor node attached to a variable node say, roots a subtree of our original factor graph. So, if we denote the product of all factors in subtree of as , where denotes all variables other than in this subtree, then we can write the joint distribution as follows

where is the set of all factor nodes that are neighbors of .

Now, if we wish to compute the marginal distribution from the joint distribution then we can simply employ the sum rule of probability

However, if we use the joint distribution, as derived from our analysis of the factor graph, the marginal distribution can be written as follows

As the sum is over all such that the sum is really over all of the various as these contain all variables in the factor graph that are

*not*. So, we can interchange the sum and product, as we did to derive the marginals for Figure 15, to obtain the following form of the marginal distribution

where the sum is now taken over all variables in the subtree rooted at that are

*not*. If we define through

then we can write the marginal distribution as follows

This allows us to view the marginal distribution as resulting from the product of all messages that "flow" to from the factor nodes connected to by an undirected edge. This situation is depicted in Figure 20.

Next let us consider in a bit more detail the form of As previously mentioned, is the product of all factors in the subtree rooted by However, as this subtree is also a tree, we can use this tree structure to decompose into its factors. In particular, as roots its subtree, is a factor of Furthermore, as our graph is tree structured, each variable node attached to the factor node roots a subtree in our factor graph. Thus, contains a factor which is the product of all factors in the subtree rooted at for each such variable node (Of course there is no factor as has already been accounted for.) With this, we can write as follows

where the variable nodes, other than , attached to are denoted by This expression for allows us to write in a more informative manner

where is the set of all variable nodes that are neighbors of and we have, for the same reasons as before, exchanged the sum and product. Now, if we define as follows

then we can rewrite more informatively as

With this we can view as the product of various messages from the variable nodes to along with the factor itself. Finally, all of this is "marginalized" by summing over This is depicted in Figure 21.

Now let us consider the factor in a bit more detail. By definition is the product of all factors in the subtree rooted by . The subtree rooted by consists of "sub-subtrees" each of which is rooted by a factor node which is connected directly to by an edge. This situation is depicted in Figure 22.

From our previous calculations we know that each of these "sub-subtrees," produces a factor Thus, can be expressed as follows

This allows us to express in terms of as follows

where we have, for the reasons outlined previously, exchanged the sum and product operations. Using the definition of the previous equation can be written as

This can be interpreted as stating that the message the variable node sends to the factor node is the product of all the messages receives from all other adjoining factor nodes.

With this last result in concert with our previous results, we now know how to pass messages along our factor graph, as we know both how to handle these messages at variable nodes and at factor nodes. The only remaining point to clarify is how the message passing process starts.

As the messages are all eventually passed to the root node, the node we want the marginal distribution of, the obvious place the messages have to start are at the leaves of the tree. The leaves of this tree can either be variable nodes or factor nodes. So, we have to handle both cases. If a leaf is a variable node then the message it passes to its parent factor node is simply

This is depicted in Figure 23a.

For the case of a leaf factor node the message it passes to its parent variable node is simply

This is depicted in Figure 23b.

So, in summary, to evaluate the marginal distribution of a variable described by a directed or undirected tree structured graph, we first use the procedure we established previously to convert the graph into a factor graph. Then we can view the factor graph as a tree rooted on the variable node Starting at the leaves of this tree we pass leaf messages, either or from the leaves to their parents. We then use the two methods we previously derived to pass messages across factor nodes

and across variable nodes

to propagate the messages from the leaves to the root At the root node we then employ

to determine the final marginal distribution.

Now, if we wanted to compute the marginal distribution for the variable nodes and then we could simply use the algorithm above two times, once with as root and once with as root. However, there is a more efficient algorithm that's much simpler. One can, just as we did above, view as a root and pass messages from the leaves up to the root this will suffice to compute the marginal However, one can now pass messages in the other direction, from the root to the leaves. Upon all the leaves receiving their messages, each variable node will have obtained messages of the form from all neighboring factor nodes. This will allow us to calculate the marginal as follows

So, with only twice the computation involved in computing we can compute for any variable in the factor graph!

Finally, we can also use the above construction to compute conditional distributions. Assume the variables of the factor graph are partitioned into observed variables and a latent variable and we want to find the distribution of conditioned on the in other words where the observed value of is To do so, we first examine all the factors in the factor graph. If a factor is a function of then we simply multiply the factor by which is if and is otherwise. After updating the factors for all the observed variables one simply runs the above algorithm. Instead of computing it now computes up to normalization, due to the introduction of the factors.