Graphical Models: Theory

August 24th, 2013
|

Here at forty.to we are starting to employ "graphical models" into our tool chain. So, to spread the gospel, and hopefully help anyone deciding to get into "graphical models", we decided to write a two part blog post on "graphical models". This first part covers the basics of "graphical model" theory. The second part will cover the practice of using "graphical models".

As we want to keep the post to a manageable length we, unfortunately, have to make some assumptions of the reader. We assume the reader is familiar with calculus, probability distributions, conditional probability distributions, and the sum and product rules of probability. One doesn't have to have a very deep understanding of any of these topics; a cursory understanding should suffice.

So, with these preliminaries out of the way, lets jump in!

Why Graphical Models?

Generally humans find reasoning about and manipulating graphical representations much easier than reasoning about and manipulating equations. So, anytime one can convert manipulations of equations to manipulations of graphical diagrams, it's a win.

One need not look far for examples of just this logic. Feynman diagrams

Feynman Diagram

revolutionized physics, allowing physicists to reason about and manipulate these graphical representations instead of equations. Knot diagrams

Knot Diagrams

allowed mathematicians to reason about knots through simple manipulation of  these funny little pictures. More exotically, Kirby Diagrams

Kirby Diagram

allow mathematicians to reason about four dimensional spaces through manipulation of diagrams like that above.

So, it seems only logical to look for a graphical representation of a probability distribution that allows one to easily manipulate and reason about that distribution. Graphical models provide just such a representation.

Bayesian Networks

Consider a probability distribution $p(a,b,c)$ over three random variables $a,$ $b,$ and $c$. Using the product rule of probability twice, we can rewrite $p(a,b,c)$ as follows

Now we will introduce a graphic form for the right hand side of this equation.

For each random variable we introduce a node and we associate each node with the conditional probability for its variable. Then for each conditional probability, i.e. node, and for each corresponding conditioning variable we add a directed edge from the node corresponding to the conditioning variable to the node corresponding to the conditioned variable.

So, the right hand side of the equation is graphically represented as in Figure 1. (Note Figure 1, and all subsequent figures, were created by Christopher M. Bishop.)

Figure 1

The distribution $p(a)$ is not conditioned; thus, there are no directed edges going in to node $a$. The distribution $p(b\mid a)$ is conditioned on $a$; thus, there is a directed edge from node $a$ to node $b$. The distribution $p(c\mid a, b)$ is conditioned on $a$ and $b$; thus, there is a directed edge from node $a$ to node $c$ and a directed edge from node $b$ to node $c$.

So, as the rules for this game are relatively simple, we can explore more complicated diagrams. For example the distribution described by Figure 2

Figure 2

has the form

One can form an infinite number of such graphs, nodes representing random variables and directed edges conditioning. Generically such a graph goes under the name of a Bayesian network, and in the following sections we will study Bayesian networks in some detail.

Conditional Independence

Beyond just looking pretty these graphical models also help one reason about "independence" and "conditional independence".

As you will recall, two random variables $a$ and $b$ are independent if the following equation holds

This notion of independence can be generalized to the case of conditional distributions as well. Given the random variable $c$ two random variables $a$ and $b$ are conditionally independent if the following equation holds

Often one is given a joint probability distribution and one needs to determine if two, or more, random variables in the joint distribution are independent or conditional independent. Given a graphical model for the distribution, this question can be answered relatively easily, as we will show below.

Tail to Tail

First consider the probability distribution depicted by the graphical model of Figure 3.

Figure 3

Using what we have discussed previously, it has the form

Let us assume that $c$ is not observed and, in the context of this assumption, examine whether $a$ and $b$ are independent. Using the sum rule of probability we can write the distribution $p(a,b)$ as

Substituting the form of the distribution $p(a,b,c)$ gleaned from the graphical model, we have

So, generically this distribution does not factor into a product of $p(a)$ and $p(b)$. Hence, $a$ and $b$ are not generically independent.

Consider now the case in which $c$ is observed. In our graphical model we will denote this by coloring the node $c$ in, as in Figure 4.

Figure 4

Again, using what we have learned about graphical models, the joint distribution is

In addition, the product rule of probability implies

Together these last two equations imply

where we have used the fact that $p(c) \neq 0$, as $c$ was observed, to divide through by $p(c)$. This last equation implies that if $c$ is observed, then $a$ and $b$ are conditionally independent.

Generally, the situation depicted in Figure 4 gives rise to the tail-to-tail rule of conditional independence. It states that if an observed node is connected to the tails of two directed edges, e.g. $c$ in Figure 4, then the two nodes at directed edges' heads, e.g. $a$ and $b$ in Figure 4, are independent. This rule simply formalizes what we discovered above.

Consider the probability distribution depicted by the graphical model of Figure 5.

Figure 5

Its joint probability distribution can be written as follows

Let us assume that $c$ is not observed and, with this assumption, examine whether $a$ and $b$ are independent. Using the sum rule of probability we can write the distribution $p(a,b)$ as follows

Using the product rule we have $p(c \mid a) p(b \mid c)=p(b,c \mid a)$. The sum rule then implies that the above equation can be written as

Generically this implies that $a$ and $b$ are not independent if $c$ is not observed.

Consider now the case in which $c$ is observed; this is depicted in Figure 6.

Figure 6

Again, from our study of graphical models, we know the joint probability distribution can be written as follows

The product rule of probability implies

So, together these imply

where we have used the fact that $p(c) \neq 0$ as $c$ has been observed. However, Bayes Rule states that

So, together that last two equations imply

This simply states that if $c$ is observed, then $a$ and $b$ are conditionally independent.

The situation depicted in Figure 6 is that of the head-to-tail rule of conditional independence. If an observed node is connected to the head of one directed edge and the tail of another, e.g. $c$ in Figure 6, then the two nodes at the edges' other ends, e.g. $a$ and $b$ in Figure 6, are independent.

As a mnemonic for the tail-to-tail rule and the head-to-tail rule, one can think of "influence" as "flowing" from  node $a$ to node $b$ and from node $b$ to node $a$ along the various edges that pass through $c.$ When $c$ is observed, and subsequently its node is colored in, this "influence" is "blocked" and thus $a$ and $b$ are independent as a result.

The tail-to-tail rule and the head-to-tail rule were relatively simple. We will now examine a more complicated rule which is derived from the distribution described in Figure 7.

Figure 7

The joint distribution described by Figure 7 is given by

We can use the sum rule of probability to determine the distribution $p(a,b)$. It is given by

where for the last equality we have used the fact that $p(c \mid a, b)$ is a probability distribution and thus the sum over $c$ is $1.$ Obviously, this implies that $a$ and $b$ are independent. Note, this is exactly opposite of the tail-to-tail and head-to-tail cases where, if $c$ was not observed, $a$ and $b$ were not independent.

Consider now the case in which $c$ is observed. This corresponds to Figure 8.

Figure 8

The joint distribution described by Figure 8 is given by

The product rule of probability implies

Hence, together these imply

where we have used the fact that $p(c) \neq 0$ which follows from the fact that $c$ was observed. Now, generically this does not imply that $a$ and $b$ are conditionally independent given $c$.

The situation depicted in Figure 8 is the head-to-head rule of conditional independence. If an observed node is connected to the head of two directed edges, e.g. $c$ in Figure 8, then the two nodes at the edges' other ends, e.g. $a$ and $b$ in Figure 8, are dependent. So, the situation here is exactly reversed from the tail-to-tail and head-to-tail cases.

Actually the head-to-head rule is more complicated than we have presented it here. It has an addendum. The full head-to-head rule of conditional independence states: If node is connected to the head of two directed edges, e.g. $c$ in Figure 8, and it or any of its descendants are observed, then the two nodes at the edges' other ends, e.g. $a$ and $b$ in Figure 8, are dependent.

So, in summary, the tail-to-tail rule states that the two nodes $a$ and $b$ in Figure 3 are dependent unless $c$ is observed in which case they become independent. The head-to-tail rule states that the two nodes $a$ and $b$ in Figure 5 are dependent unless $c$ is observed in which case they become independent. Contrastingly, the head-to-head rule says the the two nodes $a$ and $b$ in Figure 7 are independent unless $c$, or any of its descendants, are observed in which case $a$ and $b$ become dependent. Generally if two, or more, nodes are conditionally independent according to these rules, the nodes are said to be d-seperated.

Markov Blanket

In a large graphical model we will often find it of use to determine the minimal set of nodes on which a given node depends. The set of such nodes is rather colorfully know as a Markov blanket. Let us derive the Markov blanket of the node $x_i$ in Figure 9.

Figure 9

To examine which nodes are in the Markov blanket of $x_i,$ we will look at the probability of $x_i$ conditioned on all other  variables and see what nodes appear in that quantity. This set of nodes will be the Markov blanket of $x_i.$

Through use of the product rule of probability the probability of $x_i$ conditioned on all other  variables can be made to appear as a factor in the joint distribution. Explicitly, the product rule of probability implies that the joint distribution from Figure 9 can be written as

where the nodes in Figure 9 are $x_1, \ldots, x_n.$ The sum rule of probability allows us to write the previous equation as

The graphical model of Figure 9, however, allows us to write

where $pa_k$ is the set of parent nodes of $x_k$. Thus, the last two equations imply

We can actually simplify this seemingly complicated equation, and cancel equivalent factors in the numerator and denominator, by simply observing a few properties of the graph.

First, the summation in the denominator is over $x_i$. Therefore, any terms that are independent of $x_i$ can be pulled out of this sum and canceled with the corresponding terms in the numerator. Generically, a term $p(x_k \mid pa_k)$ will be independent of $x_i$ if $i \neq k$ and $x_i\not\in pk_k.$ So, rather obviously, the term $p(x_i \mid pk_i)$  can not be pulled out from the sum and any term $p(x_k \mid pk_k)$ can not be pulled out if $x_i \in pk_k.$ All other terms can be pulled out from the sum and canceled with the corresponding term from the numerator.

So, the only terms remaining are terms of the form $p(x_i \mid pk_i)$ and terms of the form $p(x_k \mid pk_k)$ where $x_i \in pk_k.$ Thus, through $p(x_i \mid pk_i)$ the random variable $x_i$ depends upon its parents $pk_i.$ Through $p(x_k \mid pk_k)$ where $x_i \in pk_k$ the random variable $x_i$ depends upon all its children and all co-parents of its children, i.e. all parents of $x_k$ other than $x_i$. This set of all nodes upon which $x_i$ depends is its Markov blanket and is colored blue in Figure 9.

Markov Random Fields

Up until now we have considered only graphs with directed edges. One could also consider undirected edges too. However, it's not immediately apparent what a graph with undirected edges would correspond to. (Using undirected edges removes any notion of conditioning.) It turns out that one can actually make sense of undirected graphs, and doing so leads to a set of models called Markov random fields.

Conditional Independence

As one will remember, the head-to-head rule was a bit confusing. It seemed more or less the exact opposite of the tail-to-tail and head-to-tail rule. One might hope that the complexity of this head-to-head rule could be simplified if there was some way of "just making all of the dependency rules look the same." One way of doing this would be to use only undirected edges. Then, as there is no head or tail of an edge, all of the rules head-to-head, head-to-tail, and tail-to-tail will "look the same." In fact this intuition actually is correct, as we will now see.

Consider the undirected graph of Figure 10. Assume we are given three sets of random variables denoted by $A,$ $B,$ and $C$ and asked to determine if $A$ and $B$ are conditionally independent given $C.$ Doing so actually turns out to be easy.

Figure 10

One only needs to consider all paths from nodes in $A$ to nodes in $B$. If all such paths contain at least one node in $C$, then $A$ and $B$ are conditionally independent given $C.$ However, if at least one such path does not contain a node in $C,$ then $A$ and $B$ need not be conditionally independent. This is much simpler than the directed case.

Markov Blanket

As a result of these conditional independence properties, determination of the Markov Blanket for a node in an undirected graph is relatively simple. For example in Figure 11

Figure 11

the Markov Blanket of the middle node is the set of nodes that share an edge with the middle node.

More generally, the Markov Blanket of  any node $x_i$ in an undirected graph is the set of nodes that share an edge with $x_i.$ Again, this is much simpler than the directed case.

Joint Distribution

Now we understand how conditional independence works for undirected graphs. However, we still haven't examined how to write the joint distribution corresponding to an undirected graph. In this section we will examine how this is done.

Generally, an undirected graph will correspond to some factorization of the joint probability; the question is: Which factorization? To answer this we will use what we already know about undirected graphs, how conditional independence works.

In an undirected graph consider two nodes $x_i$ and $x_j$ not connected by an edge. Now, for arguments sake, assume that all the other nodes $\{x_1, \ldots, x_{i-1},x_{i+1},\ldots, x_{j-1},x_{j+1}, \ldots, x_n\}$ are observed. Thus, as $x_i$ and $x_j$ are not connected by an edge, all paths from $x_i$ to $x_j$ must contain at least one observed node. Hence, conditioned on all the observed nodes $x_i$ and $x_j$ are independent. Let us now see what this fact implies about the factorization of the joint distribution.

As we don't know how to factorize the joint distribution of an undirected graph, let's just guess an answer, then see what the guess implies. We will assume that the factorization of the joint distribution contains a factor of the form $\psi(x_i,x_j,\ldots)$. Generically, such a term implies that $x_i$ and $x_j$ are not conditionally independent when all other nodes are observed, as generically such a term does not factor into two terms $\psi(x_i,\ldots)\psi(x_j,\ldots).$ However, conditioned on all the observed nodes we have just found that $x_i$ and $x_j$ are independent. Thus, we know that such a term $\psi(x_i,x_j,\ldots)$ can not appear in the factorization of the joint distribution. This leads to a general rule: A factor in a joint distribution contains only variables which are connected by an edge.

Though this general rule actually suffices for writing the joint distribution, we will find it convenient to introduce a few new concepts which simplify the process of writing the joint distribution. The first is a clique, a  clique is a set of nodes such that each pair of nodes in the set is connected by a edge. For example, in Figure 12 the set of nodes $\{x_3, x_4\}$ forms a clique.

Figure 12

One can also define a maximal clique, a maximal clique is a clique in which no additional elements may be added while maintaining the clique property. For example, the set of nodes $\{x_3, x_4\}$ is a clique, but not a maximal clique, as one can add node $x_2$ to the set while maintaining the clique property. However, $\{x_2, x_3, x_4\}$ is a maximal clique, as one can not add $x_1$ to the set while maintaining the clique property as $x_1$ and $x_4$ are not connected by a edge.

So, as factors in a joint distribution contain only variables connected by an edge, we can use the partitioning of an undirected graph into maximal cliques to guide us in writing the joint distribution. As each node in a maximal clique is connected by an edge to any other element in the maximal clique, our general rule implies that the associated variables should appear in the same factor in the joint distribution. So, if we define $MC$ as the set of maximal cliques, then the above allows us to write the joint distribution as

where the product is over all maximal cliques $C$ and we have introduced $Z,$ known as the partition function, to normalize the joint distribution.

Undirected vs Directed

So far we have examined directed graphs and undirected graphs. However, we have not examined the relation between a directed graph and an undirected graph. Here we will examine this relation.

Consider the directed graph in Figure 13a. From what we have seen, the joint distribution described by this directed graph is

As this is such a simple graph we can guess that the corresponding undirected graph is obtained by making all directed edges undirected, Figure 13b.

Figure 13a

Figure 13b

The joint distribution described by the undirected graph in Figure 13b is given by

Now, from the two previous equations, we can guess the relation between the various factors in the directed joint distribution and the undirected joint distribution. They are

Note that with these definitions of $\psi_C(C)$ the partition function $Z$ is identically $1$, as the various probabilities involved are already normalized.

Let us move on to more complicated models. Consider the directed graph of Figure 14a. The joint distribution it describes is

Here our previous tactic of making all directed edges into undirected edges breaks down. For example, doing so would lead to an undirected graph in which all factors containing $x_4$ would be

Obviously this is not correct as there is no way that the conditional dependence of $p(x_4 \mid x_1,x_2,x_3)$ is captured from such terms. However, the fix is not too hard to come by.

Figure 14a

From the previous example we see that we need to create an undirected graph such that all of the parents of $x_4$ appear together in a maximal clique, as it is only then that a term containing $x_1,$  $x_2,$ $x_3,$ and $x_4$ will appear in the factorization of the undirected graph. To do this one starts with Figure 14a, then adds undirected edges between all parents of $x_4;$ finally one makes all directed edges undirected. The resulting undirected graph is shown in Figure 14b.

Figure 14b

This achieves our goal of putting $x_1,$  $x_2,$ $x_3,$ and $x_4$ all in a single maximal clique. The joint distribution described by Figure 14b is

This distribution, in contrast to our original guess, is actually correct. However, as one will note, some of the finer dependency information captured by the directed graph is not captured by the corresponding undirected graph; for example, the fact that the joint distribution has a $p(x_1)$ factor is totally lost. So, this process of going from a directed graph to an undirected graph "destroys" information.

The general procedure of going from a directed graph to an undirected is only a slight generalization of what we did above. Given a directed graph one first finds all nodes with more than one parent. For each such node one one adds an undirected edge between all pairs of its parents. (This process of  "marrying parents"  goes by the name of moralization, an unfortunate nomenclature.)  Finally, one makes all remaining directed edges undirected. As to the mapping from conditional probabilities of the directed graph to the factors $\psi_C(C)$, there will always be at least one valid mapping as a result of moralization.

Inference in Graphical Models

So far we have examined Bayesian networks and Markov random fields. For both we have examined how to go from a graphical model to a joint distribution, and we have examined how conditional independence works in both cases. However, we have not examined how to determine marginal or conditional probabilities from the joint distribution. Here we'll examine how to obtain marginal and conditional probabilities from the joint distribution in an efficient manner for both Bayesian networks and Markov random fields.

Inference on a Chain

We will start with the simplest case, that of determining the marginal distribution of a single random variable $x_n$ in the joint distribution described by the graph in Figure 15. From the sum rule of probability we know

From Figure 15

Figure 15

we know that the joint distribution is given by

The last two equations thus imply

Generally this looks like a mess. However, a simple observation allows us to unravel this mess.

This simple observation is that multiplication distributes over addition

So, if we have a sum of various terms such that each term contains a common factor, then we can pull this common factor outside the summation, as in the above simple example. Consider now applying this simple observation to our rather messy equation above.

Consider the summation over $x_N$. The variable $x_N$ only appears in a single factor $\psi_{\{x_{N-1},x_{N}\}}(x_{N-1},x_{N})$. Thus, all the other factors in the product of the $\psi_{\{x_i,x_{i+1}\}}(x_i,x_{i+1})$'s are common factors relative to this sum; so, we can pull them outside the summation over $x_N$. Consider next the summation over $x_{N-1}$. The variable $x_{N-1}$ appears in two terms, the summation mentioned above, through $\psi_{\{x_{N-1},x_{N}\}}(x_{N-1},x_{N})$, and also the factor $\psi_{\{x_{N-2},x_{N-1}\}}(x_{N-2},x_{N-1})$. All other factors in the product of the $\psi_{\{x_i,x_{i+1}\}}(x_i,x_{i+1})$'s are common factors relative to the sum over $x_{N-1}$; so, we can pull them outside the summation over $x_{N-1}$. This story continues with $x_{N-2}$, $x_{N-3}$, $x_{N-4}$... until one reaches $x_n$ which is not summed over. Then the entire process starts again, but with $x_1$ and from the other side.

Consider the summation over $x_1$. The variable $x_1$ only appears in a single factor $\psi_{\{x_{1},x_{2}\}}(x_{1},x_{2})$. So, all other factors in the product of the $\psi_{\{x_i,x_{i+1}\}}(x_i,x_{i+1})$'s are common factors relative to this sum; thus, we can pull them outside the summation over $x_1$. Considering the summations over $x_2$,  $x_3$,  $x_4$... we see that we can play the same game. So, using all of these facts together, we see that we can write the marginal distribution over $p(x_n)$ as follows

This is simpler, but still somewhat of a mess. We can clean it up using the following definitions

These allow us to write $p(x_n)$ in the simple form

This is much better.

Beyond allowing us to express $p(x_n)$ in a compact form, the factors $\mu_\alpha(x_n)$ and $\mu_\beta(x_n)$ allow us to interpret the calculation of $p(x_n)$ in a very fruitful manner. Staring at the definition of $\mu_\alpha(x_n)$, one sees that it can be written recursively

This suggests that we can interpret $\mu_\alpha(x_i)$ as a message being passed from lower numbered nodes to higher numbered nodes, where at each node the equation above is applied to obtain the subsequent $\mu_\alpha(x_i)$. The initial message in this chain is then given by

Similarly, $\mu_\beta(x_n)$ also admits a recursive definition

This suggests that we can interpret $\mu_\beta(x_i)$ as a message being passed but from higher numbered nodes to lower numbered nodes, where at each node the above equation is applied to obtain the subsequent $\mu_\beta(x_i)$. In this chain the initial message is given by

So, in summary, the recursive message passing of $\mu_\alpha(x_i)$ and $\mu_\beta(x_i)$ along this chain can be depicted as in Figure 16.

Figure 16

So we have found that, for the simple case of the undirected graph of Figure 15, we can now marginalize to obtain the distribution $p(x_n)$ for any node $x_n$ in the graph. Next we will see how to extend this result to an arbitrary tree, i.e. a graph in which there is one and only one path between any pair of nodes. However, before we are able to do so, we must take a detour and examine so called "factor graphs."

Factor Graphs

Both the directed graphs of Bayesian networks and the undirected graphs of Markov random fields are used to express joint probabilities as a products over various factors. Generically, both types of graphs yield a joint distribution of the general form

where $S$ is a set of subsets of $\{x_1,\ldots,x_N\}.$ A "factor graph" is simply another type of graph which expresses such a factorization.

Explicitly, a factor graph is a graph with a node, depicted using a circle,  for every variable in the distribution and with a node, depicted using a square, for every factor $f_s(X_s)$ in the distribution. It also contains undirected links which link each factor node to the variable nodes on which the factor depends. So, for example, the factor node corresponding to $f_s(X_s)$ will have undirected links to all the variable nodes corresponding to variables in the set $X_s.$

As an example, consider the joint distribution

The factor graph corresponding to this joint distribution is depicted in Figure 17.

Figure 17

Figure 17 has a circular node for each of the variables $x_1,$ $x_2,$ and $x_3,$ and it has a square node for each of the factors $f_a,$  $f_b,$  $f_c,$  and $f_d.$ In addition, it has undirected edges connecting each factor node to the variable nodes on which the factor depends, e.g. the factor node of $f_a(x_1,x_2)$ has undirected edges to the variable nodes for $x_1$ and $x_2.$

Given the undirected graph of a Markov random field, we can produce a factor graph that represents the same distribution. One first introduces a variable node for each  variable node of the undirected graph, then for each clique of the undirected graph one introduces a factor node. Finally, one connects the factor nodes to the variable nodes in the factor node's clique. The factors $f_{X_s}(X_s)$ then are simply set equal to the corresponding clique factors. For example, using these rules the undirected graph of Figure 18a is converted to the factor graph of Figure 18b. However, one should also note that factor graphs are more "flexible" than the undirected graphs of a Markov random field. For example, the undirected graph of Figure 18a could also correspond to the factor graph of Figure 18c but not using the rules described above.

Figure 18a

Figure 18b

Figure 18c

Given the directed graph of a Bayesian network, we can also produce a factor graph that represents the same distribution. First, one creates a variable node for each variable node of the directed graph, then one introduces a factor node for each of the directed graph's conditional distributions. Finally, one adds undirected links linking each factor node to the variable nodes in the factor node's conditional distribution. The factors $f_{X_s}(X_s)$ then are simply set equal to the corresponding conditional distribution. For example, using these rules the directed graph of Figure 19a results in the factor graph of Figure 19b. As in the case of undirected graphs, there may be more than one factor graph corresponding to a given directed graph. For example, the directed graph of Figure 19a also corresponds to the factor graph of Figure 19c, however not using the rules above.

Figure 19a

Figure 19b

Figure 19c

Sum Product Algorithm

Next we will examine how to employ factor graphs to derive marginal distributions from directed or undirected tree-structured graphs. We will derive the so-called "sum-product algorithm," which computes such marginal distributions.

If we are given a directed or undirected tree-structured graph describing a joint distribution $p(x_1,\ldots,x_N),$ then we can, using the rules we established in the previous section, convert the graph into a factor graph that describes the same joint distribution. Using this factor graph, we can then write this joint distribution as a product of various factors, each of which corresponds to a factor node.

Now, one can easily prove that as we stared with a tree-structured graph our factor graph is also tree-structured. Thus, each factor node $f_s$ attached to a variable node $x,$ say, roots a subtree of our original factor graph. So, if we denote the product of all factors in subtree of $f_s$ as $F_s(x,X_s)$, where $X_s$ denotes all variables other than $x$ in this subtree, then we can write the joint distribution $p(x_1,\ldots,x_N)$ as follows

where $ne(x)$ is the set of all factor nodes that are neighbors of $x$.

Now, if we wish to compute the marginal distribution $p(x)$ from the joint distribution $p(x_1,\ldots,x_N),$ then we can simply employ the sum rule of probability

However, if we use the joint distribution, as derived from our analysis of the factor graph, the marginal distribution can be written as follows

As the sum is over all $x_i$ such that $x_i\ne x,$ the sum is really over all of the various $X_s,$ as these contain all variables in the factor graph that are not $x$. So, we can interchange the sum and product, as we did to derive the marginals for Figure 15, to obtain the following form of the marginal distribution

where the sum is now taken over all variables $X_s$ in the subtree rooted at $f_s$ that are not $x$. If we define $\mu_{f_s\rightarrow x}(x)$ through

then we can write the marginal distribution $p(x)$ as follows

This allows us to view the marginal distribution $p(x)$ as resulting from the product of all messages $\mu_{f_s\rightarrow x}(x)$ that "flow" to $x$ from the factor nodes $f_s$ connected to $x$ by an undirected  edge. This situation is depicted in Figure 20.

Figure 20

Next let us consider in a bit more detail the form of $F_s(x,X_s).$ As previously mentioned, $F_s(x,X_s)$ is the product of all factors in the subtree rooted by $f_s.$ However, as this subtree is also a tree, we can use this tree structure to decompose $F_s(x,X_s)$ into its factors. In particular, as $f_s$ roots its subtree, $f_s$ is a factor of $F_s(x,X_s).$ Furthermore, as our graph is tree structured, each variable node $x_m$ attached to the factor node $f_s$ roots a subtree in our factor graph. Thus, $F_s(x,X_s)$ contains a factor $G_m(x_m,X_{s_m}),$  which is the product of all factors in the subtree rooted at $x_m,$ for each such variable node $x_m.$ (Of course there is no $G(x,X_s)$ factor as $x$ has already been accounted for.) With this, we can write $F_s(x,X_s)$ as follows

where the variable nodes, other than $x$, attached to $f_s$ are denoted by $x_1,\ldots,x_M.$ This expression for $F_s(x,X_s)$ allows us to write $\mu_{f_s\rightarrow x}(x)$ in a more informative manner

where $ne(f_s)$ is the set of all variable nodes that are neighbors of $f_s$ and we have, for the same reasons as before, exchanged the sum and product. Now, if we define $\mu_{x_m\rightarrow f_s}(x_m)$ as follows

then we can rewrite $\mu_{f_s\rightarrow x}(x)$ more informatively as

With this we can view $\mu_{f_s\rightarrow x}(x)$ as the product of various messages $\mu_{x_m\rightarrow f_s}(x_m),$ from the variable nodes $x_m$ to $f_s,$ along with the factor $f_s$ itself. Finally, all of this is "marginalized" by summing over $x_1,\ldots,x_M.$ This is depicted in Figure 21.

Figure 21

Now let us consider the factor $G_m(x_m,X_{s_m})$ in a bit more detail. By definition $G_m(x_m,X_{s_m})$ is the product of all factors in the subtree rooted by $x_m$. The subtree rooted by $x_m$ consists of "sub-subtrees" each of which is rooted by a factor node $f_l$ which is connected directly to $x_m$ by an edge. This situation is depicted in Figure 22.

Figure 22

From our previous calculations we know that each of these "sub-subtrees," produces a factor $F_l(x_m,X_{m_l}).$ Thus, $G_m(x_m,X_{s_m})$ can be expressed as follows

This allows us to express $\mu_{x_m\rightarrow f_s}(x_m)$ in terms of $F_l(x_m,X_{m_l})$ as follows

where we have, for the reasons outlined previously, exchanged the sum and product operations. Using the definition of $\mu_{f_s\rightarrow x}(x)$ the previous equation can be written as

This can be interpreted as stating that the message the variable node $x_m$ sends to the factor node $f_s$ is the product of all the messages $x_m$ receives from all other adjoining factor nodes.

With this last result in concert with our previous results, we now know how to pass messages along our factor graph, as we know both how to handle these messages at variable nodes and at factor nodes. The only remaining point to clarify is how the message passing process starts.

As the messages are all eventually passed to the root node, the node we want the marginal distribution of, the obvious place the messages have to start are at the leaves of the tree. The leaves of this tree can either be variable nodes or factor nodes. So, we have to handle both cases. If a leaf is a variable node $x,$ then the message it passes to its parent factor node $f$ is simply

This is depicted in Figure 23a.

Figure 23a

For the case of a leaf factor node $f,$ the message it passes to its parent variable node $x$ is simply

This is depicted in Figure 23b.

Figure 23b

So, in summary, to evaluate the marginal distribution $p(x)$ of a variable described by a directed or undirected tree structured graph, we first use the procedure we established previously to convert the graph into a factor graph. Then we can view the factor graph as a tree rooted on the variable node $x.$ Starting at the leaves of this tree we pass leaf messages, either $\mu_{x\rightarrow f}(x)=1$ or $\mu_{f\rightarrow x}(x)=f(x),$ from the leaves to their parents. We then use the two methods we previously derived to pass messages across factor nodes

and across variable nodes

to propagate the messages from the leaves to the root $x.$ At the root node we then employ

to determine the final marginal distribution.

Now, if we wanted to compute the marginal distribution for the variable nodes $x_i$ and $x_j,$ then we could simply use the algorithm above two times, once with $x_i$ as root and once with $x_j$ as root. However, there is a more efficient algorithm that's much simpler. One can, just as we did above, view $x$ as a root and pass messages from the leaves up to the root $x,$ this will suffice to compute the marginal $p(x).$ However, one can now pass messages in the other direction, from the root to the leaves. Upon all the leaves receiving their messages, each variable node $x_k$ will have obtained messages of the form $\mu_{f_s\rightarrow x_k}(x_k)$ from all neighboring factor nodes. This will allow us to calculate the marginal $p(x_k)$ as follows

So, with only twice the computation involved in computing $p(x),$ we can compute $p(x_k)$ for any variable $x_k$ in the factor graph!

Finally, we can also use the above construction to compute conditional distributions. Assume the variables of the factor graph are partitioned into observed variables $x_i$ and a latent variable $z$ and we want to find the distribution of $z$ conditioned on the $x_i,$ in other words $p(z\mid \hat{x}_1,\ldots,\hat{x}_M)$ where the observed value of $x_i$ is $\hat{x}_i.$ To do so, we first examine all the factors in the factor graph. If a factor $f_s(\ldots,x_i,\ldots)$ is a function of $x_i,$ then we simply multiply the factor $f_s(\ldots,x_i,\ldots)$ by $I(x_i,\hat{x}_i),$ which is $1$ if $x_i=\hat{x}_i$ and is $0$ otherwise. After updating the factors for all the observed variables $x_i,$ one simply runs the above algorithm. Instead of computing $p(z)$ it now computes $p(z\mid \hat{x}_1,\ldots,\hat{x}_M)$ up to normalization, due to the introduction of the $I(x_i,\hat{x}_i)$ factors.