Скачать книгу

the context. On each trial k, the system receives empirical data, a pair of two real column vectors of sizes M and N : a pattern x(k)M; and a desired label d(k)N, with all pairs sharing the same desired relation d(k) = f(x(k)). Note that two distinct patterns can have the same label. The objective of the system is to estimate (learn) the function f(·) using the empirical data. As a simple example, suppose W is a tunable N × M matrix of parameters, and consider the estimator

      which is a single‐layer NN. The result of the estimator r = Wx should aim to predict the correct desired labels d = f(x) for new unseen patterns x. As before, to solve this problem, W is tuned to minimize some measure of error between the estimated and desired labels, over a K0 ‐long subset of the training set (for which k = 1, … , K0 ). If we define the error vector as y(k)Δ = d(k) − r(k), then a common measure is the mean square error: MSE equals sigma-summation Underscript k equals 1 Overscript upper K 0 Endscripts double-vertical-bar normal y Superscript left-parenthesis k right-parenthesis Baseline double-vertical-bar squared. Other error measures can be also be used. The performance of the resulting estimator is then tested over a different subset called the test set (k = K0 + 1, … , K).

      As before, a reasonable iterative algorithm for minimizing this objective is the online gradient descent (SGD) iteration normal upper W Superscript left-parenthesis k plus 1 right-parenthesis Baseline equals normal upper W Superscript left-parenthesis k right-parenthesis Baseline minus eta nabla Subscript normal upper W Sub Superscript left-parenthesis k right-parenthesis Baseline double-vertical-bar normal y Superscript left-parenthesis k right-parenthesis Baseline double-vertical-bar squared slash 2, where the 1/2 coefficient is written for mathematical convenience, η is the learning rate, a (usually small) positive constant; and at each iteration k, a single empirical sample x(k) is chosen randomly and presented at the input of the system. Using nabla Subscript normal upper W Sub Superscript left-parenthesis k right-parenthesis Baseline double-vertical-bar normal y Superscript left-parenthesis k right-parenthesis Baseline double-vertical-bar squared equals minus 2 left-parenthesis normal d Superscript left-parenthesis k right-parenthesis Baseline minus normal upper W Superscript left-parenthesis k right-parenthesis Baseline normal x Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis left-parenthesis normal x Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis Superscript normal upper T and defining ΔW(k) = W(k + 1) − W(k) and (·)T to be the transpose operation, we obtain the outer product

      (3.76)StartLayout 1st Row normal upper Delta normal upper W Superscript left-parenthesis k right-parenthesis Baseline equals eta normal y Superscript left-parenthesis k right-parenthesis Baseline left-parenthesis normal x Superscript left-parenthesis k right-parenthesis Baseline right-parenthesis Superscript normal upper T Baseline 2nd Row upper W Subscript italic n m Superscript left-parenthesis k plus 1 right-parenthesis Baseline equals upper W Subscript italic n m Superscript left-parenthesis k right-parenthesis Baseline plus eta x Subscript m Superscript left-parenthesis k right-parenthesis Baseline y Subscript n Superscript left-parenthesis k right-parenthesis Baseline period EndLayout

      The parameters of more complicated estimators can also be similarly tuned (trained), using backpropagation, as discussed earlier in this chapter.

      Notations: In the following, we will use xD to represent a column vector with D elements and a capital letter to denote a matrix XH × W with H rows and W columns. The vector x can also be viewed as a matrix with 1 column and D rows. These concepts can be generalized to higher‐order matrices, that is, tensors. For example, xH × W × D is an order‐3 (or third‐order) tensor. It contains HWD elements, each of which can be indexed by an index triplet (i, j, d), with 0 ≤ i < H, 0 ≤ j < W, and 0 ≤ d < D. Another way to view an order‐3 tensor is to treat it as containing D channels of matrices.

      For example, a color image is an order‐3 tensor. An image with H rows and W columns is a tensor of size H × W × 3; if a color image is stored in the RGB format, it has three channels (for R, G and B, respectively), and each channel is an H × W matrix (second‐order tensor) that contains the R (or G, or B) values of all pixels.

      It is beneficial to represent images (or other types of raw data) as a tensor. In early computer vision and pattern recognition, a color image (which is an order‐3 tensor) is often converted to the grayscale version (which is a matrix) because we know how to handle matrices much better than tensors. The color information is lost during this conversion. But color is very important in various image‐ or video‐based learning and recognition problems, and we do want to process color information in a principled way, for example, as in CoNNs.

      Given a tensor, we can arrange all the numbers inside it into a long vector, following a prespecified order. For example, in MATLAB, the (:) operator converts a matrix into a column vector in the column‐first order as

      (3.77)upper A equals Start 2 By 2 Matrix 1st Row 1st Column 1 2nd Column 2 2nd Row 1st Column 3 2nd Column 4 EndMatrix comma upper A left-parenthesis colon right-parenthesis equals left-parenthesis 1 comma 3 comma 2 comma 4 right-parenthesis Superscript upper T

      We use the notation “vec” to represent this vectorization operator. That is, vec(A) = (1, 3, 2, 4)T in the example. In order to vectorize an order‐3 tensor, we could vectorize its first channel (which is a matrix, and we already know how to vectorize it), then the second channel, … , and so on, until all channels are vectorized. The vectorization of the order‐3 tensor is then the concatenation of the vectorizations of all the channels in this order. The vectorization of an order‐3 tensor is a recursive process that utilizes the vectorization of order‐2 tensors. This recursive process can be applied to vectorize an order‐4 (or even higher‐order) tensor in the same manner.

      Vector calculus and the chain rule: The CoNN learning process depends on vector calculus and the chain rule. Suppose z is a scalar (i.e., z) and yH is a vector. If z is a function of y, then the partial derivative of z with respect to y is defined as [∂z/∂y]i = ∂z/∂yi . In other words, ∂z/∂y is a vector having the same size as y, and its i‐th element is ∂z/∂yi . Also, note that ∂z/∂yT = (∂z/∂y)T.

      Suppose now that xW is another vector, and y is a function of x. Then, the partial derivative of y with respect to x is defined as [∂y/∂xT]ij = ∂yi/∂xj . This partial derivative is a H

Скачать книгу