Preprocessing is the task of applying a set of transformations to the data items in a data set. Preprocessing functionallity is available using the preprocessing tool, which can be launched from the right-click context menu in the data matrix.
The information used to perform a transformation is called a preprocessor description. Using the preprocessing tool, one can define preprocessors as well as applying them. The tool supports loading and saving preprocessor descriptions, which is useful for dealing with different data sets that require the same transformations to be usable in HUGIN.
The preprocessing tool is displayed in Figure 1.
Figure 1: The preprocessor tool - a number of preprocessor preprocessor descriptions has been loaded |
Use the function buttons '+' and '-' to create and delete preprocessors, the load and save buttons to store preprocessor descriptions in files. One has the option to save all preprocessors or only selected when clicking the save button, see Figure 2.
Figure 2: Save all or selected preprocessors |
<preprocessor type> <column specifier> <argument>*The preprocessor type can be one of:
REPLACE REPLACE_REGEX TO_UPPER TO_LOWER DISCRETIZE_MANUAL DISCRETIZE_EQUAL_DISTRIBUTIONThe column specifier can select a single column based on the column name:
NAME <column name>Or a set of columns where the column names matches a regular expression (regular expressions described further below):
SELECT <regular expression>The arguments depend on the chosen preprocessor type:
<match text string> <replacement text string>
<regular expression> <replacement text string>
<lower bound first interval> <upper bound previous interval/lower bound next interval>* <upper bound last interval>
<number of states>
A regular expression is a pattern used to match a sequence of characters. The pattern matching rules used in HUGIN follow the java regular expressions pattern matching from java.util.regex.Pattern.
A summary of selected regular-expression constructs:
Characters x The character x \\ The backslash character Character classes [abc] a, b, or c (simple class) [^abc] Any character except a, b, or c (negation) [a-zA-Z] a through z or A through Z, inclusive (range) [a-d[m-p]] a through d, or m through p: [a-dm-p] (union) Predefined character classes . Any character \d A digit: [0-9] \D A non-digit: [^0-9] \s A whitespace character: [ \t\n\x0B\f\r] \S A non-whitespace character: [^\s] \w A word character: [a-zA-Z_0-9] \W A non-word character: [^\w] Greedy quantifiers X? X, once or not at all X* X, zero or more times X+ X, one or more times Logical operators XY X followed by Y X|Y Either X or Y (X) X, as a capturing groupThe backslash character ('\') serves to introduce escaped constructs, as defined in the table above, as well as to quote characters that otherwise would be interpreted as unescaped constructs. Thus the expression \\ matches a single backslash and \{ matches a left brace.
Examples of regular expressions:
[n].* Match any string that begins with the character 'n', e.g. 'not', 'nothing' etc. \d+(\.\d*)? Match any string that is a number, on the form 1 or 1.234 etc.
Click the '+' button to create a new preprocessor, the dialog in Figure 3 appears.
Figure 3: Select a preprocessor type |
Figure 4: Select target column for the new preprocessor |
This preprocessor replaces any data items that matches a specific text string, with a replacement text string. The REPLACE preprocessor takes two parameters, the text string to match and the replacement text string.
REPLACE NAME <column-name> <match text string> <replacement text string>When creating the preprocessor using the guided approach, specify parameters in the dialogs in Figures 5 and 6.
Figure 5: Enter the text string that should be replaced - any data items that matches the string 'value' will be replaced |
Figure 6: Enter the replacement text string - replace any data items that matches 'value' with 'val' |
This preprocessor is similar to the normal REPLACE preprocessor, except that data items are matched using a regular expression instead of a fixed text string. The REPLACE_REGEX preprocessor takes two parameters, the regular expression used to select matching data items and the replacement text string.
REPLACE_REGEX NAME <column-name> <regular expression> <replacement text string>When creating the preprocessor using the guided approach, specify parameters in the dialogs in Figures 7 and 8.
Figure 7: Enter a regular expression that matches the desired data items - this regular expression matches any data item that begin with a lower- or uppercase n |
Figure 8: Enter the replacement text string - replace any data items that matches the regular expression with 'no' |
This is a very simple preprocessor, which converts any lower case characters to upper case. The TO_UPPER preprocessor has no parameters.
TO_UPPER NAME <column-name>
This is a very simple preprocessor, which converts any upper case characters to lower case. The TO_LOWER preprocessor has no parameters.
TO_LOWER NAME <column-name>
This preprocessor applies a discretization to all data items. The DISCRETIZE_MANUAL preprocessor takes a variable number of parameters, namely target intervals specified as a list of interval boundaries:
DISCRETIZE_MANUAL NAME <column-name> <lower bound first interval> <upper bound previous interval/lower bound next interval>* <upper bound last interval>When creating the preprocessor using the guided approach, the discretization tool is spawned to aid specifying the intervals.
This preprocessor applies a discretization to all data items. The target intervals are dynamically generated based on all the numeric data items in the column, such that each interval contain approximately the same number of data items. The DISCRETIZE_EQUAL_DISTRIBUTION preprocessor takes a single parameter, the number of target intervals.
DISCRETIZE_EQUAL_DISTRIBUTION NAME <column-name> <number of states>When creating the preprocessor using the guided approach, specify parameters in the dialog in Figure 9.
Figure 9: Specify the number of intervals |
Before running a preprocessor, one must make sure that it performs as inteded. This is done by selecting the desired preprocessor in the list (see Figure 10), and then clicking the preview button.
Figure 10: Selecting a preprocessor - the selected preprocessor is of type DISCRETIZE_MANUAL |
The preview reports any errors/partial success and the transformations done by the selected preprocessor. Use the preview functionallity when writing and debugging a preprocessor description. The preview window can be seen in Figure 11.
Figure 11: Preview window - see how selected preprocessors performs, inspect errors etc. |
To apply a preprocessor, select the preprocessor and click the run button. The preprocessor is applied to the data set, and a window appears with a summary of which preprocessors succeded and which failed, see Figure 12. If any errors appear, use the preview functionallity for further investigation and debugging.
Figure 12: Summary of run preprocessors |