Скачать книгу

last condition is based on the statistical confidence interval for proportions. Usually, the last condition is used such that T refers to a subtree whose root is the internal node t, and S denotes the portion of the training set that refers to the node. The pessimistic pruning procedure performs top‐down traversing over the internal nodes. If an internal node is pruned, then all its descendants are removed from the pruning process, resulting in a relatively fast pruning.

      Errorbased pruning (ebp): This is an evolution of pessimistic pruning. As in pessimistic pruning‚ the error rate is estimated using the upper bound of the statistical confidence interval for proportions:

      (2.25)epsilon Subscript italic upper U upper B Baseline left-parenthesis upper T comma upper S right-parenthesis equals epsilon left-parenthesis upper T comma upper S right-parenthesis plus upper Z Subscript alpha Baseline dot StartRoot StartFraction epsilon left-parenthesis upper T comma upper S right-parenthesis dot left-parenthesis 1 minus epsilon left-parenthesis upper T comma upper S right-parenthesis right-parenthesis Over bar upper S bar EndFraction EndRoot

      where ε(T, S) denotes the misclassification rate of the tree T on the training set S. Z is the inverse of the standard normal cumulative distribution, and α is the desired significance level. Let subtree (T, t) denote the subtree rooted by the node t. Let maxchild (T, t) denote the most frequent child node of t (namely, most of the instances in S reach this particular child), and let St denote all instances in S that reach the node. The procedure performs bottom‐up traversal over all nodes and compares the following values:

epsilon Subscript italic upper U upper B Baseline left-parenthesis subtree left-parenthesis upper T comma t right-parenthesis comma upper S Subscript t Baseline right-parenthesis epsilon Subscript italic upper U upper B Baseline left-parenthesis pruned left-parenthesis subtree left-parenthesis upper T comma t right-parenthesis comma t right-parenthesis comma upper S Subscript t Baseline right-parenthesis right-parenthesis epsilon Subscript italic upper U upper B Baseline left-parenthesis subtree left-parenthesis upper T comma maxchild left-parenthesis upper T comma t right-parenthesis right-parenthesis comma upper S Subscript maxchild left-parenthesis upper T comma t right-parenthesis Baseline right-parenthesis period

      According to the lowest value, the procedure either leaves the tree as is, prunes the node, or replaces the node t with the subtree rooted by maxchild (T, t).

      Optimal pruning (opt): Bohanec and Bratko [17] introduced an algorithm guaranteeing optimality called optimal pruning (opt). This algorithm finds the optimal pruning based on dynamic programming, with a complexity of θ(∣l(T)|2), where T is the initial decision tree. Almuallim [18] introduced an improvement of opt called opt‐2, which also performs optimal pruning using dynamic programming. However, the time and space complexities of opt‐2 are both Θ( ∣ l(T*) ∣ ∣internal (T) ∣ ), where T * is the target (pruned) decision tree, and T is the initial decision tree.

      Minimum description length (MDL) pruning: Rissanen [19], Quinlan and Rivest [20], and Mehta et al. [21] used the MDL to evaluate the generalized accuracy of a node. This method measures the size of a decision tree by means of the number of bits required to encode the tree. The MDL method prefers decision trees that can be encoded with fewer bits. Mehta et al. [21] indicate that the cost of a split at a leaf t can be estimated as

      (2.26)Cost left-parenthesis upper T right-parenthesis equals sigma-summation Underscript c Subscript i Baseline element-of dom left-parenthesis y right-parenthesis Endscripts bar sigma Subscript y equals c Sub Subscript i Subscript Baseline upper S Subscript t Baseline bar dot ln StartFraction bar upper S Subscript t Baseline bar Over bar sigma Subscript y equals c Sub Subscript i Subscript Baseline upper S Subscript t Baseline bar EndFraction plus StartFraction bar dom left-parenthesis y right-parenthesis bar negative 1 Over 2 EndFraction ln StartFraction bar upper S Subscript t Baseline bar Over 2 EndFraction plus ln StartStartFraction pi Superscript StartFraction bar dom left-parenthesis y right-parenthesis bar Over 2 EndFraction Baseline OverOver normal upper Gamma left-parenthesis StartFraction bar dom left-parenthesis y right-parenthesis bar Over 2 EndFraction right-parenthesis EndEndFraction

      where ∣St∣ denote the number of instances that have reached the node.

      The splitting cost of an internal node is calculated based on the cost aggregation of its children.

      2.2.3 Dimensionality Reduction Techniques

      In this section, we provide an overview of the mathematical properties and foundations of the various dimensionality reduction techniques [22–24]

      There are several dimensionality reduction techniques specifically designed for time series. These methods specifically exploit the frequential content of the signal and its usual sparseness in the frequency space. The most popular methods are those based on wavelets [25, 26], and a distant second is empirical mode decomposition [27, 28] (the reader is referred to the references above for further details). We do not cover these techniques here since they are not usually applied for the general‐purpose dimensionality reduction of data. From a general point of view, we may say that wavelets project the input time series onto a fixed dictionary (see Section 2.3). This dictionary has the property of making the projection sparse (only a few coefficients are sufficiently large), and the dimensionality reduction is obtained by setting most coefficients (the small ones) to zero. Empirical mode decomposition instead constructs a dictionary specially adapted to each input signal.

      To maintain the consistency of this review, we do not cover those dimensionality reduction techniques that take into account the class of observations; that is, there are observations from a class A of objects, observations from a class B, … and the dimensionality reduction technique should maintain, to the extent possible, the separability of the original classes. Fisher’s Linear Discriminant Analysis ) was one of the first techniques to address this issue [29, 30]. Many other works have followed since then; for the most recent works and for a bibliographical review, see [31, 35]. Next, we will focus on vector quantization and mixture models and PCA, which was already introduced to some extent in the previous section.

      In the following, we will refer to the observations as input vectors x, whose dimension is M. We will assume that we have N observations, and we will refer to the nth observation as xn. The whole dataset of observations will be X, whereas X will be a M × N matrix with all the observations as columns. Note that non‐bold small letters represent vectors (x), whereas capital, non‐bold letters (X) represent matrices.

Скачать книгу