Скачать книгу

including the pertaining terminology are given in Figures 2.3 and 2.4.

      Design Example 2.3

      Decision tree induction is closely related to rule induction. Each path from the root of a decision tree to one of its leaves can be transformed into a rule simply by conjoining the tests along the path to form the antecedent part, and taking the leaf’s class prediction as the class value. For example, one of the paths in Figure 2.16 can be transformed into the rule: “If customer age <30, and the gender of the customer is “male,” then the customer will respond to the mail.”

Schematic illustration of decision tree presenting response to direct mailing.

      The goal was to predict whether an email message is spam (junk email) or good.

       Input features: Relative frequencies in a message of 57 of the most commonly occurring words and punctuation marks in all the training email messages.

       For this problem, not all errors are equal; we want to avoid filtering out good email, while letting spam get through is not desirable but less serious in its consequences.

       The spam is coded as 1 and email as 0.

       A system like this would be trained for each user separately (e.g. their word lists would be different).

       Forty‐eight quantitative predictors – the percentage of words in the email that match a given word. Examples include business, address, Internet, free, and George. The idea was that these could be customized for individual users.

       Six quantitative predictors – the percentage of characters in the email that match a given character. The characters are ch;, ch(, ch[, ch!, ch$, and ch#.

       The average length of uninterrupted sequences of capital letters: CAPAVE.

       The length of the longest uninterrupted sequence of capital letters: CAPMAX.

       The sum of the length of uninterrupted sequences of capital letters: CAPTOT.

       A test set of size 1536 was randomly chosen, leaving 3065 observations in the training set.

       A full tree was grown on the training set, with splitting continuing until a minimum bucket size of 5 was reached.

      Design Example 2.4

Schematic illustration of predicting email spam.

      Source: Trevor Hastie [12].

Schematic illustration of top-down algorithmic framework for decision tree induction.

      Impuritybased criteria: Given a random variable ModifyingAbove x With ampersand c period dotab semicolon with k discrete values, distributed according to P = (p1, p2, …, pk), an impurity measure is a function ϕ: [0, 1]kR that satisfies the following conditions: ϕ(P) ≥ 0; ϕ(P) is minimum if ∃i such that component Pi = 1; ϕ(P) is maximum if ∀i, 1 ≤ ik, Pi = 1/k; ϕ(P) is symmetric with respect to components of P; φ(P) is differentiable everywhere in its range. If the probability vector has a component of 1 (the variable x gets only one value), then the variable is defined as pure. On the other hand, if all components are equal the level of impurity reaches a maximum. Given a training set S, the probability vector of the target attribute y is defined as

      (2.12)upper P Subscript y Baseline left-parenthesis upper S right-parenthesis equals left-parenthesis StartFraction bar sigma Subscript y equals c 1 Baseline upper S bar Over bar upper S bar EndFraction comma ellipsis comma StartFraction bar sigma Subscript y equals c Sub Subscript bar dom left-parenthesis y right-parenthesis bar Subscript Baseline upper S bar Over bar upper S bar EndFraction right-parenthesis

      The goodness of split due to the discrete attribute ai is defined as a reduction in impurity of the target attribute after partitioning S according to the values vi, j ∈ dom(ai):

      (2.13)upper Delta upper Phi left-parenthesis a Subscript i Baseline comma upper S right-parenthesis equals phi left-parenthesis upper P Subscript y Baseline left-parenthesis upper S right-parenthesis right-parenthesis minus sigma-summation Underscript j equals 1 Overscript bar dom left-parenthesis a Subscript i Baseline right-parenthesis bar Endscripts StartFraction bar sigma Subscript a Sub Subscript i Subscript equals v Sub Subscript i comma j Subscript Baseline upper S bar Over bar upper S bar EndFraction dot phi left-parenthesis upper P Subscript y Baseline left-parenthesis sigma Subscript a Sub Subscript i Subscript equals v Sub Subscript i comma j Subscript Baseline upper S right-parenthesis right-parenthesis period

Скачать книгу