Скачать книгу

102,764,544 Fully connected 2 4,096 neurons 16,781,312 Fully connected 3 1,000 neurons 4,097,000 Softmax 1,000 classes Schematic illustration of architecture of GoogleNet.

      First layer: Here, input image is 224 × 224 × 3, and the output feature is 112 × 112 × 64. Followed by the convolutional layer uses a kernel of size 7 × 7 × 3 and with step 2. Then, followed by ReLU and max pooling by 3 × 3 kernel with step 2, now the output feature map size is 56 × 56 × 64. Then, do the local response normalization.

      Second layer: It is a simplified inception model. Here, 1 × 1 convolution using 64 filters generate feature maps from the previous layer’s output before performing the 3 × 3 (with step 2) convolutions using 64 filters. Then, perform ReLU and local response normalization. Finally, perform a 3 × 3 max pooling with stride 2 to obtain 192 numbers of output of 28 feature maps.

      Third layer: Is a complete inception module. The previous layer’s output is 28 × 28 with 192 filters and there will be four branches originating from the previous layer. The first branch uses 1 × 1 convolution kernels with 64 filters and ReLU, generates 64, 28 × 28 feature map; the second branch uses 1 × 1 convolution with 96 kernels (ReLU) before 3 × 3 convolution operation (with 128 filters), generating 128 × 28 × 28 feature map; the third branch use 1 × 1 convolutions with 16 filters (using ReLU) of 32 × 5 × 5 convolution operation, generating 32 × 28 × 28 feature map; the fourth branch contains 3 × 3 max pooling layer and a 1 × 1 convolution operation, generating 32 × 28 × 28 feature maps. And it is followed by concatenation of the generated feature maps that provide an output of 28 × 28 feature map with 258 filters.

      1.2.6 ResNet

      Usually, the input feature map will be fed through a series of convolutional layer, a non-linear activation function (ReLU) and a pooling layer to provide the output for the next layer. The training is done by the back-propagation algorithm. The accuracy of the network can be improved by increasing depth. Once the network gets converged, its accuracy saturates. Further, if we add more layers, then the performance gets degraded rapidly, which, in turn, results in higher training error. To solve the problem of the vanishing/exploding gradient, ResNet with a residual learning framework [6] was proposed by allowing new layers to fit a residual mapping. When a model is converged than to fit the mapping, it is easy to push the residual to zero. The principle of ResNet is residual learning and identity mapping and skip connections. The idea behind the residual learning is that it feeds the input image to the next convolutional layer and adds them together and performs non-linear activation (ReLU) and pooling.

Layer name Input size Filter size Window size # Filters Stride Depth # 1 × 1 # 3 × 3 reduce # 3 × 3 # 5 × 5 reduce # 5 × 5 Pool proj Padding Output size Params Ops
Convolution 224 × 224 7 × 7 - 64 2 1 2 112 × 112 × 64 2.7M 34M
Max pool 112 × 112 - 3 × 3 - 2 0 0 56 × 56 × 64
Convolution 56 × 56 3 × 3 -

Скачать книгу