Preprocessing

Preprocessing is the task of applying a set of transformations to the data items in a data set. Preprocessing functionallity is available using the preprocessing tool, which can be launched from the right-click context menu in the data matrix.

The information used to perform a transformation is called a preprocessor description. Using the preprocessing tool, one can define preprocessors as well as applying them. The tool supports loading and saving preprocessor descriptions, which is useful for dealing with different data sets that require the same transformations to be usable in HUGIN.

The preprocessing tool is displayed in Figure 1.

Figure 1: The preprocessor tool - a number of preprocessor preprocessor descriptions has been loaded
The top half of the window contains a list of preprocessors. When a preprocessor is selected, the description can be edited in the details text field located in the bottom half of the window.

Use the function buttons '+' and '-' to create and delete preprocessors, the load and save buttons to store preprocessor descriptions in files. One has the option to save all preprocessors or only selected when clicking the save button, see Figure 2.

Figure 2: Save all or selected preprocessors
Use the preview button before applying a preprocessor, to see how it performs. Click the run button to apply a preprocessor.

Preprocessor Descriptions

A preprocessor description is a list of text lines, containing three pieces of information: the preprocessor type, a column specifier and any parameters for the preprocessor:
<preprocessor type>
<column specifier>
<argument>*
The preprocessor type can be one of:
REPLACE
REPLACE_REGEX
TO_UPPER
TO_LOWER
DISCRETIZE_MANUAL
DISCRETIZE_EQUAL_DISTRIBUTION
The column specifier can select a single column based on the column name:
NAME <column name>
Or a set of columns where the column names matches a regular expression (regular expressions described further below):
SELECT <regular expression>
The arguments depend on the chosen preprocessor type:

Regular expressions

A regular expression is a pattern used to match a sequence of characters. The pattern matching rules used in HUGIN follow the java regular expressions pattern matching from java.util.regex.Pattern.

A summary of selected regular-expression constructs:

Characters
x	The character x
\\ 	The backslash character

Character classes
[abc]		a, b, or c (simple class)
[^abc]		Any character except a, b, or c (negation)
[a-zA-Z]	a through z or A through Z, inclusive (range)
[a-d[m-p]]	a through d, or m through p: [a-dm-p] (union)

Predefined character classes
. 	Any character
\d 	A digit: [0-9]
\D 	A non-digit: [^0-9]
\s 	A whitespace character: [ \t\n\x0B\f\r]
\S 	A non-whitespace character: [^\s]
\w 	A word character: [a-zA-Z_0-9]
\W 	A non-word character: [^\w]

Greedy quantifiers
X? 	X, once or not at all
X* 	X, zero or more times
X+ 	X, one or more times

Logical operators
XY 	X followed by Y
X|Y 	Either X or Y
(X) 	X, as a capturing group
The backslash character ('\') serves to introduce escaped constructs, as defined in the table above, as well as to quote characters that otherwise would be interpreted as unescaped constructs. Thus the expression \\ matches a single backslash and \{ matches a left brace.

Examples of regular expressions:

[n].*		Match any string that begins with the character 'n', e.g. 'not', 'nothing' etc.
\d+(\.\d*)?	Match any string that is a number, on the form 1 or 1.234 etc.

Creating a Preprocessor

Click the '+' button to create a new preprocessor, the dialog in Figure 3 appears.

Figure 3: Select a preprocessor type
The guided approach consists of a sequence of questions and finally automatic generation of the resulting preprocessor description. After choosing a preprocessor type, one must select a target column in the data set. See Figure 4.
Figure 4: Select target column for the new preprocessor

Preprocessor: Replace

This preprocessor replaces any data items that matches a specific text string, with a replacement text string. The REPLACE preprocessor takes two parameters, the text string to match and the replacement text string.

REPLACE
NAME <column-name>
<match text string>
<replacement text string>
When creating the preprocessor using the guided approach, specify parameters in the dialogs in Figures 5 and 6.
Figure 5: Enter the text string that should be replaced - any data items that matches the string 'value' will be replaced
Figure 6: Enter the replacement text string - replace any data items that matches 'value' with 'val'

Preprocessor: Regular expression replace

This preprocessor is similar to the normal REPLACE preprocessor, except that data items are matched using a regular expression instead of a fixed text string. The REPLACE_REGEX preprocessor takes two parameters, the regular expression used to select matching data items and the replacement text string.

REPLACE_REGEX
NAME <column-name>
<regular expression>
<replacement text string>
When creating the preprocessor using the guided approach, specify parameters in the dialogs in Figures 7 and 8.
Figure 7: Enter a regular expression that matches the desired data items - this regular expression matches any data item that begin with a lower- or uppercase n
Figure 8: Enter the replacement text string - replace any data items that matches the regular expression with 'no'

Preprocessor: Upper case

This is a very simple preprocessor, which converts any lower case characters to upper case. The TO_UPPER preprocessor has no parameters.

TO_UPPER
NAME <column-name>

Preprocessor: Lower case

This is a very simple preprocessor, which converts any upper case characters to lower case. The TO_LOWER preprocessor has no parameters.

TO_LOWER
NAME <column-name>

Preprocessor: Manual Discretization

This preprocessor applies a discretization to all data items. The DISCRETIZE_MANUAL preprocessor takes a variable number of parameters, namely target intervals specified as a list of interval boundaries:

DISCRETIZE_MANUAL
NAME <column-name>
<lower bound first interval>
<upper bound previous interval/lower bound next interval>*
<upper bound last interval>
When creating the preprocessor using the guided approach, the discretization tool is spawned to aid specifying the intervals.

Preprocessor: Equal Distribution Discretization

This preprocessor applies a discretization to all data items. The target intervals are dynamically generated based on all the numeric data items in the column, such that each interval contain approximately the same number of data items. The DISCRETIZE_EQUAL_DISTRIBUTION preprocessor takes a single parameter, the number of target intervals.

DISCRETIZE_EQUAL_DISTRIBUTION
NAME <column-name>
<number of states>
When creating the preprocessor using the guided approach, specify parameters in the dialog in Figure 9.
Figure 9: Specify the number of intervals
Depending on how well the data values are scattered, the number of intervals may be pruned in order to make the resulting intervals equally distributed.

Preview and Run Preprocessor

Before running a preprocessor, one must make sure that it performs as inteded. This is done by selecting the desired preprocessor in the list (see Figure 10), and then clicking the preview button.

Figure 10: Selecting a preprocessor - the selected preprocessor is of type DISCRETIZE_MANUAL

The preview reports any errors/partial success and the transformations done by the selected preprocessor. Use the preview functionallity when writing and debugging a preprocessor description. The preview window can be seen in Figure 11.

Figure 11: Preview window - see how selected preprocessors performs, inspect errors etc.

To apply a preprocessor, select the preprocessor and click the run button. The preprocessor is applied to the data set, and a window appears with a summary of which preprocessors succeded and which failed, see Figure 12. If any errors appear, use the preview functionallity for further investigation and debugging.

Figure 12: Summary of run preprocessors


Back