## NPC Algorithm

Two constraint-based algorithms are available for structure learning: The PC algorithm and the NPC algorithm.

NPC stands for Necessary Path Condition, and it is a criterion developed by researchers at Siemens in Munich for solving some of the problems of constraint-based learning algorithms like the PC algorithm. The basic machinery is the same in the PC and the NPC algorithms (i.e., they are both based on generating a skeleton derived through statistical tests for conditional independence).

The NPC algorithm seeks to repair the deficiencies of the PC algorithm, which occur especially in the face of limited data sets. The solution provided by the NPC algorithm is based on inclusion of a criterion known as the necessary path condition. This criterion forms the basis for introducing the notion of ambiguous regions, which in turn provide a language for selecting among sets of inter-dependent uncertain links. The resolution of ambiguous regions is performed in interaction with the user.

In constraint-based learning algorithms, the skeleton of the graph is constructed by not including a link in the induced graph whenever the corresponding nodes are found to be conditional independent. There can, however, be inconsistencies among the set of conditional independence and dependence statements (CIDs) derived from limited data sets. That is, not all CIDs can be represented simultaneously. The inconsistencies are assumed to stem solely from sampling noise; i.e., we still assume that there exists a perfect independence map of the (unknown) probability distribution from which the data is generated.

The number of inconsistencies in the set of CIDs reflects structural model uncertainty. Thus, the number of uncertainties is a confidence measure for the learned structure and can as such be used as an indication of whether or not sufficient data has been used to perform the learning. The inconsistent CIDs produce multiple solutions when inducing a directed acyclic graph (DAG) from them. These solutions differ with respect to the set of links included.

To resolve the inconsistencies, the NPC algorithm relies on user interaction where the user gets the opportunity to decide on directionality of undirected links and to resolve the ambiguous regions.

### The Necessary Path Condition

Informally, the necessary path condition says that in order for two variables X and Y to be independent conditional on a set S, with no proper subset of S for which this holds, there must exist a path between X and every Z in S (not crossing Y) and between Y and every Z in S (not crossing X). Otherwise, the inclusion of Z in S is unexplained. Thus, in order for an independence statement to be valid, a number of links are required to be present in the graph.

### Ambiguous Regions

When the absence of a link, a, depends on the presence of another link, b, and vice versa, we define a and b to be inter-dependent. Both a and b constitute what we call uncertain links. An ambiguous region is a maximal set of inter-dependent links. That is, an ambiguous region consists of a set of uncertain links.

The main goal is to obtain as few and small ambiguous regions as possible. It should be noted that deterministic relations between variables will also produce ambiguous regions (see example below).

If there are some uncertain links (or links that need to be directed), the user is offered the possibility of providing information as to how the ambiguous regions should be resolved. A special, intuitive graphical interface is provided for this interaction.

### Resolving Ambiguous Regions and Undirectedness

When using the NPC algorithm, an intuitive graphical interface is provided for resolving the structural uncertainties (if any) found by the algorithm. Figure 1 shows an example of a structure with unresolved uncertainties.

 Figure 1: The structural uncertainties found by the NPC algorithm. Black (undirected) links are selectable and their directionality can be determined by the user. The links of each ambiguous region are drawn in the same color. These links are also selectable and will be removed or kept, depending on the action performed by the user.

Should you wish not to provide such information, just click the Done button, and the NPC algorithm will resolve the uncertainties. Note, however, that directionality for undirected links will be decided on a random basis, and that for those ambiguous regions with no information provided, all of the uncertain links will be removed, possibly resulting in a poor model.

Instead of randomly determining the directionality of the links of the learned structure that cannot be determined automatically from the data, the NPC algorithm gives the user the opportunity to determine the directionality of such links.

The undirected links (not belonging to ambiguous regions) are drawn in black. When selected, such a link gets highlighted and the two directionality buttons get enabled:

 or      or      or Figure 2: The different appearances of the directionality buttons.

Which of the above appearances of the pair of buttons are used depends on the relative position of the nodes of the selected link. Thus, it should be readily apparent which of the two buttons should be selected for achieving the desired directionality for the selected link. However, when positioning the mouse cursor on top of a button, a tool tip will appear after a short while, explaining precisely which direction is associated with the button in question.

Note that assigning directionality to a link may cause other links to automatically be directed. If, for example, there are undirected links between variables x and y, between y and z, and between x and z, then enforcing the directionalities x --> y and y --> z implies x --> z; otherwise, a directed cycle would occur.

#### Resolving Ambiguous Regions

An ambiguous region consists of a set of inter-dependent uncertain links: Absence of a link in an ambiguous region depends on the presence of one or more of the other links of the region, and vice versa.

Each ambiguous region is easily identified as consisting of links in the same color. (Note that there can be so many ambiguous regions that colors have to be reused, making it difficult from the coloring alone to distinguish them.) Also, when selecting a link of an ambiguous region, all links of the region will be highlighted; the one selected will be drawn bold and the other links will be drawn with double lines.

When a link of an ambiguous region has been selected, the include/exclude link buttons get enabled:

 Figure 3: The include/exclude link buttons.

When a decision has been made that a link should be present or absent, each of the other links of the ambiguous region will be affected in one of several ways:

• Unaffected.
• Disappear.
• Turned into an undirected link, not belonging to the region anymore.
• Turned into a directed link, not belonging to the region anymore.

Which of these consequence will be observed depends on the conditional independence and dependence statements (CIDs) found from the statistical tests performed by the NPC algorithm.

#### Identify Minimal Resolutions

An ambiguous regions may consist of a large number of uncertain edges. This implies that it ay be very difficult for the user to find an appropriate resolution of an ambiguous region.

Each ambiguous region will have at least one minimal resolution. A minimal resolution is a minimal set of edges, which resolves the ambiguous region.

 Figure 4: The identify minimal resolutions button.

When an uncertain link of an ambiguous region has been selected the button shown in Figure 4 appears. If this button is depressed, a window showing all minimal resolutions is displayed to the user. The user may select an minimal resolution.

### Undo

Often it is impossible (or very hard) to predict the consequences of a decision made regarding directionality of an undirected link, or regarding the presence or absence of a link of an ambiguous region. Therefore, you will probably find the undo facility quite useful.

The most recent (non-undo) action performed can be undone by clicking the undo button:

 Figure 5: The undo button.

Please note that an arbitrary number of operations performed can be undone by repeatedly clicking the undo button.

### References

The interested reader is referred to the literature for a more detailed description of the notions behind the NPC algorithm. See e.g.

• Steck, H. (2001). Constrained-Based Structural Learning in Bayesian Networks Using Finite Data Sets, PhD Thesis, Institut für der Informatik der Technischen Universität München.
• Steck, H. and Tresp, V. (1999). Bayesian Belief Networks for Data Mining, Proceedings of The 2nd Workshop on Data Mining und Data Warehousing als Grundlage Moderner Entschidungsunterstuezender Systeme, DWDW99, Sammelband, Universität Magdeburg, September 1999.
• Steck, H., Hofmann, R., and Tresp, V. (1999). Concept for the PRONEL Learning Algorithm, Siemens AG, Munich.

Back