Computational Statistics in Data Science. Группа авторов. Математика. . Читать онлайн. Литмир. LITMIR.BIZ

Computational Statistics in Data Science. Группа авторов

Читать онлайн.

Информация о произведении:

Название Computational Statistics in Data Science

Год выпуска 0

isbn 9781119561088

Автор произведения Группа авторов

Жанр Математика

Серия

Издательство John Wiley & Sons Limited

Computational Statistics in Data Science - Группа авторов

Скачать книгу

the same structure and weights. A simple example of the process can be written as

(9)

where and are weight matrices of network , is an activation function, and is the bias vector. Depending on the task, the loss function is evaluated, and the gradient is backpropagated through the network to update its weights. For the classification task, the final output can be passed into another network to make prediction. For a sequence‐to‐sequence model, can be generated based on and then compared with .

However, a drawback of RNN is that it has problem “remembering” remote information. In RNN, long‐term memory is reflected in the weights of the network, which memorizes remote information via shared weights. Short‐term memory is in the form of information flow, where the output from the previous state is passed into the current state. However, when the sequence length is large, the optimization of RNN suffers from vanishing gradient problem. For example, if the loss is evaluated at , the gradient w.r.t. calculated via backpropagation can be written as

(10)

where is the reason for the vanishing gradient. In RNN, the tanh function is commonly used as the activation function, so

(11)

Therefore, , and is always smaller than 1. When becomes larger, the gradient will get closer to zero, making it hard to train the network and update the weights with remote information. However, it is possible that relevant information is far apart in the sequence, so how to leverage remote information of a long sequence is important.

6.3 Long Short‐Term Memory Networks

To solve the problem of losing remote information, researchers proposed long short‐term memory (LSTM) networks. The idea of LSTM was introduced in Hochreiter and Schmidhuber [19], but it was applied to recurrent networks much later. The basic structure of LSTM is shown in Figure 9. It solves the problem of the vanishing gradient by introducing another hidden state , which is called the cell state.

Since the original LSTM model was introduced, many variants have been proposed. Forget gate was introduced in Gers et al. [20]. It has been proven effective and is standard in most LSTM architectures. The forwarding process of LSTM with a forget gate can be divided into two steps. In the first step, the following values are calculated:

(12) Скачать книгу

Новинки

Популярные

Наши рекомендации

ТОП просматриваемых книг сайта: