Скачать книгу

easy to see that z is a function of x in a chain‐like argument: a function maps x to y, and another function maps y to z. The chain rule can be used to compute ∂z/∂xT as

      (3.78)StartFraction partial-differential z Over partial-differential x Superscript upper T Baseline EndFraction equals StartFraction partial-differential z Over partial-differential y Superscript upper T Baseline EndFraction StartFraction partial-differential y Over partial-differential x Superscript upper T Baseline EndFraction period

      3.6.1 CoNN Architecture

      A CoNN usually takes an order‐3 tensor as its input, for example, an image with H rows, W columns, and three channels (R, G, B color channels). Higher‐order tensor inputs, however, can be handled by CoNN in a similar fashion. The input then goes through a series of processing steps. A processing step is usually called a layer, which could be a convolution layer, a pooling layer, a normalization layer, a fully connected layer, a loss layer, etc. We will introduce the details of these layers later.

      For now, let us give an abstract description of the CNN structure first. Layer‐by‐layer operation in a forward pass of a CoNN can be formally represented as x1w1x2 → ⋯ → xL − 1wL − 1xLwLz. This will be referred to as the operation chain. The input is x1, usually an image (an order‐3 tensor). It undergoes the processing in the first layer, which is the first box. We denote the parameters involved in the first layer’s processing collectively as a tensor w1. The output of the first layer is x2, which also acts as the input to the second‐layer processing. This processing continues until all layers in the CoNN have been processed, upon which xL is outputted.

      The last layer is a loss layer. Let us suppose t is the corresponding target (ground truth) value for the input x1; then a cost or loss function can be used to measure the discrepancy between the CoNN prediction xL and the target t. For example, a simple loss function could be z = ‖txL2/2, although more complex loss functions are usually used. This squared 2 loss can be used in a regression problem. In a classification problem, the cross‐entropy loss is often used. The ground truth in a classification problem is a categorical variable t. We first convert the categorical variable t to a C−dimensional vector . Now both t and xL are probability mass functions, and the cross‐entropy loss measures the distance between them. Hence, we can minimize the cross‐entropy. The operation chain explicitly models the loss function as a loss layer whose processing is modeled as a box with parameters wL. Note that some layers may not have any parameters; that is, wi may be empty for some i. The softmax layer is one such example.

      The forward run: If all the parameters of a CoNN model w1, … , wL − 1 have been learned, then we are ready to use this model for prediction, which only involves running the CNN model forward, that is, in the direction of the arrows in the operational chain. Starting from the input x1, we make it pass the processing of the first layer (the box with parameters w1), and get x2. In turn, x2 is passed into the second layer, and so on. Finally, we achieve xLC, which estimates the posterior probabilities of x1 belonging to the C categories. We can output the CNN prediction as arg maxi x Subscript i Superscript upper L.

      SGD: As before in this chapter, the parameters of a CoNN model are optimized to minimize the loss z; that is, we want the prediction of a CoNN model to match the ground‐truth labels. Let us suppose one training example x1 is given for training such parameters. The training process involves running the CoNN network in both directions. We first run the network in the forward pass to get xL to achieve a prediction using the current CoNN parameters. Instead of outputting a prediction, we need to compare the prediction with the target t corresponding to x1, that is, continue running the forward pass until the last loss layer. Finally, we achieve a loss z. The loss z is then a supervision signal, guiding how the parameters of the model should be modified (updated). And the SGD method of modifying the parameters is wiwiη∂z/∂wi. Here, the ←sign implicitly indicates that the parameters wi (of the i‐layer) are updated from time t to t + 1. If a time index t is explicitly used, this equation will look like (wi)t + 1 = (wi)tη∂z/(wi)t.

      Error backpropagation: As before, the last layer’s partial derivatives are easy to compute. Because xL is connected to z directly under the control of parameters wL, it is easy to compute ∂z/∂wL .This step is only needed when wL is not empty. Similarly, it is also easy to compute partial-differential z slash partial-differential x Superscript upper L Baseline periodIf the squared 2 loss is used, we have an empty ∂z/∂wL and ∂z/∂wL = xLt. For every layer i, we compute two sets of gradients: the partial derivatives of z with respect to the parameters wi and that layer’s input xi. The term ∂z/∂wi can be used to update the current (i‐th) layer’s parameters, while ∂z/∂xi can be used to update parameters backward, for example, to the (i − 1)‐th layer. An intuitive explanation is that xi is the output of the (i − 1)‐th layer and ∂z/∂xi is how xi should be changed to reduce the loss function. Hence, we could view ∂z/∂xi as the part of the “error” supervision information propagated from z backward until the current layer, in a layer‐by‐layer fashion. Thus, we can continue the backpropagation process and use ∂z/∂xito propagate the errors backward to the (i − 1)‐th layer. This layer‐by‐layer backward updating procedure makes learning a CoNN much easier. When we are updating the i‐th layer, the backpropagation process for the (i + 1)‐th layer must have been completed. That is, we must already have computed the terms ∂z/∂wi + 1 and ∂z/∂xi + 1. Both are stored in memory and ready for use.Now our task is to compute ∂z/∂wi and ∂z/∂xi. Using the chain rule, we have

Скачать книгу