Troubleshooting and Diagnosis

A common use of Bayesian belief network models is to diagnosis system failures in a probabilistic framework.

This has several advantages over conventional rule-based or decision-tree methods, since Bayes nets support uncertain evidence in a theoretically correct fashion. In addition, prior distributions in Bayes nets can be built that model logical functions such as AND, OR and NOT using what are known as deterministic nodes; that is, nodes whose distributions contain only zeroes and ones. Such nodes, therefore, act as logic gates.

A key question in decision theory is this: In the current evidence setting, what new evidence would most effectively lead to a clear diagnosis? Often known as the value of information, information theory provides mathematical approaches to answering this question.

Types of Decision-Theoretic Diagnosis

MSBNX supports two algorithms that use information theory to order or rank variables in a Bayes net according to their information weight or influence.

In either scenario, variables or nodes in the model play certain roles. These roles are also known as labels, and must be assigned correctly or the results cannot be interpreted.

Both methods produce as a result an ordered list of variables ranked by a value of information score. In a typically implementation, for example, this list would determine the order of questions being asked of a diagnostician or technician.

Diagnosis: Mutual Information

In such a model, variables are assigned one of two roles:

  1. Hypothesis Node. Also known as a hidden variable, this is typically a variable that cannot be directly observed. It is the target or purpose of the overall diagnosis.

  2. Information Node. An observable variable that influences the hypothesis node(s) in the model.

There may be other nodes in the model which are not labeled; although they influence inference in the normal way, they do not otherwise enter into the diagnostic process.

Utility-based diagnosis uses mutual information to compute the amount of weight or "lift" that evidence about the state of each information node would bring to each hypothesis variable. The resulting ranking of uncertain (undetermined) information nodes is used to expedite the diagnostic process.

Troubleshooting: Fix-or-Repair Planning

In addition to being assigned to roles, variables in a troubleshooting model are also given one or more costs. The belief network author may consider that these costs are measured in dollars (or other monetary currency), time (in minutes or seconds) or any other unit that is consistent with the problem formulation.

Troubleshooting uses an algorithm that iterates over all reasonable repair plans in an attempt to find the ones with the highest likelihood of success at the cheapest cost. The result is a list of nodes, ordered by cost. Establishing evidence about the top-ranked (cheapest) node is guaranteed (within the limits of the model) to lead to correct diagnosis in the shortest and cheapest number of steps.

Requirements for Diagnosis

To perform mutual information diagnosis in a model:

Requirements for Troubleshooting

Cost Factors

There are three types of costs that are important in a troubleshooting model.

Each of the different possible roles of variables in troubleshooting networks may have either a cost to observe, a cost to fix, both or neither. The service cost of the model is treated as the cost to fix for the entire model as a whole. Note: The service cost is required to perform troubleshooting.

The roles of variables in a troubleshooting model and their costs are as follows

Name

Costs Allowed

Purpose

informational

observe

Used to define observable evidentiary variables

problem-defining

fix

Used to define primary symptoms of failure; that is, the element of the model that is the target of the diagnosis.

fixable and observable

observe and fix

Used to define observable and replaceable elements

fixable but unobservable

fix

Used to define elements that can only be replaced or repaired

unfixable

neither

Used to define elements that can neither be fixed or observed

other

neither

Used to define variables that play no direct part in the diagnostic process. These may be deterministic or "modal" variables that reshape the problem in a logical fashion.

Establishing "Problem" Nodes

The most vital part of a troubleshooting network is its problem nodes. These nodes must be declared in a particular manner: state zero (the first state declared) must be associated with the normal behavior of the component or element. All other states must be associated with the mutually exclusive and exhaustive set of states associated with failure modes of the component. Many problem nodes have only two states: "Works" and "Doesn't Work". If a problem node has more than two states, they must correlate to clearly distinct situations. Consider a computer printer with four states: "Works", "No Paper is Output" "Printing is Very Slow", and "Printing is Garbled". In each case, observation allows its problem state to be distinguished. (There is, however, some ambiguity-- consider a case where printing is both garbled and slow.)

Multiple problem nodes may be defined, but only one is actually considered during any given troubleshooting session.

Using Troubleshooting

The mechanics of troubleshooting work as follows.

  1. One of the problem nodes is set (instantiated) to one of its problem states (that is, a state other than state zero).

  2. The Troubleshooting Recommendations algorithm is run, and a ranked list of nodes is returned, each with its predicted utility.

  3. The highest (first) variable in the ranked list is the one with the lowest cost. The technician or diagnostician would then attempt to gather evidence about this variable.

  4. The evidence found about the highest ranked variable is entered as evidence into the model. Alternatively, evidence can be entered for any other uninstantiated node in the collection.

  5. Return to step 2.

Evaluation of Diagnostic and Troubleshooting Models

The evaluation window of MSBNX will attempt to determine the correct type of procedure to perform. The rules it uses are as follows.

If other criteria for diagnosis are not met, an error message will indicate the situation.

Additional Information

See Web Links for Microsoft Research Technical Reports about troubleshooting under uncertainty.