Скачать книгу

is approximated by the chi‐square distribution. Note that since the log likelihood for the saturated model is common to both deviance values, ΔD is calculated without actually estimating the saturated model. This fact becomes very important during subset selection. The formula for ΔD that is used for testing the significance of the regression coefficient(s) associated with the independent variable X1 is ΔDX1 = Dwithout X1Dwith X1 = −2 [Lwithout X1LSaturated] + 2[Lwith X1LSaturated] = −2[LwithoutX1LwithX1].

      Note that this formula looks identical to the likelihood ratio statistic. Because of the similarity between the change in deviance test and the likelihood ratio test, their names are often used interchangeably.

      The formula for the Wald statistic is z Subscript j Baseline equals b Subscript j Baseline slash s Subscript b Sub Subscript j, where s Subscript b Sub Subscript j is an estimate of the standard error of bj provided by the square root of the corresponding diagonal element of the covariance matrix, upper V left-parenthesis ModifyingAbove beta With ampersand c period circ semicolon right-parenthesis. With large sample sizes, the distribution of zj is closely approximated by the normal distribution. With small and moderate sample sizes, the normal approximation is described as “adequate.”

      2.2.2 Decision Tree Classifiers

      As indicated in Section 2.1, decision trees are considered to be one of the most popular approaches for representing classifiers. Here, we present a number of methods for constructing decision tree classifiers in a top‐down manner. This section suggests a unified algorithmic framework for presenting these algorithms and describes the various splitting criteria and pruning methodologies. The material is discussed along the lines presented in [8–11].

      Supervised methods are methods that attempt to discover relationship between the input attributes and the target attribute. The relationship discovered is represented in a structure referred to as a model. Usually, models can be used for predicting the value of the target attribute knowing the values of the input attributes. It is useful to distinguish between two main supervised models: classification models (classifiers) and regression models. Regression models map the input space into a real‐valued domain, whereas classifiers map the input space into predefined classes. For instance, classifiers can be used to classify students into two groups: those passing exams on time and those passing exams with a delay. Many approaches are used to represent classifiers. The decision tree is probably the most widely used approach for this purpose.

      Attributes can be nominal or numeric. When the attribute ai is nominal, it is useful to denote by dom left-parenthesis a Subscript i Baseline right-parenthesis equals StartSet v Subscript i comma 1 Baseline comma v Subscript i comma 2 Baseline comma ellipsis comma v Subscript i comma bar dom left-parenthesis a Sub Subscript i Subscript right-parenthesis bar Baseline EndSet its domain values, where ∣ dom(ai)∣ stands for its finite cardinality. In a similar way, dom(y) = {c1, …, c∣ dom(y)∣} represents the domain of the target attribute. Numeric attributes have infinite cardinalities. The set of all possible examples is called the instance space, which is defined as the Cartesian product of all the input attributes domains: X = dom(a1) × dom(a2) × … × dom(an). The universal instance space (or the labeled instance space) U is defined as the Cartesian product of all input attribute domains and the target attribute domain, that is, U = X × dom(y). The training set is a bag instance consisting of a set of m tuples (also known as records). Each tuple is described by a vector of attribute values in accordance with the definition of the bag schema. Formally, the training set is denoted as

upper S left-parenthesis upper R right-parenthesis equals left-parenthesis left pointing angle x 1 comma y 1 right pointing angle comma ellipsis comma left pointing angle x Subscript normal m Baseline comma y Subscript m Baseline right pointing angle right-parenthesis where x Subscript q Baseline element-of upper X and y Subscript q Baseline element-of dom left-parenthesis y right-parenthesis period

      Usually, it is assumed that the training set tuples are generated randomly and independently according to some fixed and unknown joint probability distribution D over U. Note that this is a generalization of the deterministic case when a supervisor classifies a tuple using a function y = f(x). Here‚ we use the common notation of bag algebra to present the projection (π) and selection (σ) of tuples.

      The ML community, which is target audience of this book, has introduced the problem of concept learning. To learn a concept is to infer its general definition from a set of examples. This definition may be either explicitly formulated or left implicit, but either way it assigns each possible example to the concept or not. Thus, a concept can be formally regarded as a function from the set of all possible examples to the Boolean set {true, false}.

      On the other hand, the data mining community prefers to deal with a straightforward extension of the concept learning, known as the classification problem. In this case, we search for a function that maps the set of all possible examples into a predefined set of class labels and is not limited to the Boolean set.

      An inducer is an entity that obtains a training set and forms a classifier that represents the generalized relationship between the input attributes and the target attribute.

      The notation I represents an inducer and I(S) represents a classifier that was induced by performing I on a training set S. Most frequently, the goal of the classifier’s inducers is formally defined as follows: Given a training set S with input attributes set A = {a1, a2, … an} and a target attribute y from a unknown fixed distribution D over the labeled instance space, the goal is to induce an optimal classifier with minimum generalization error.

      Generalization error is defined as the misclassification rate over the distribution D. In case of the nominal attributes, it can be expressed as

      (2.11)sigma-summation Underscript left pointing angle x comma y right pointing angle element-of upper U Endscripts upper D left-parenthesis x comma y right-parenthesis dot upper L left-parenthesis y comma upper I left-parenthesis upper S right-parenthesis left-parenthesis x right-parenthesis right-parenthesis

      where L(y, I(S)(x) is the loss function defined as L(y, I(S)(x)) = 0, if y = I(S)(x) and 1, if yI(S)(x). In the case of numeric attributes, the sum operator is replaced with the appropriate integral operator.

Скачать книгу