Скачать книгу

as the vector , where H represents the measurement horizon.

       – The action: at each state, the agent must select the blocking factor p which will need to be considered by the IoT objects. This value is continuous and determinist in the problem that we consider, that is that the same state sk will always give the same action ak.

       – The revenue: this is a signal that the agent receives from the environment after the execution of an action. Thus, at stage k, the agent obtains a revenue rk as a consequence of the action ak that it carried out in state sk. This revenue will allow the agent to know the quality of the action executed, the objective of the agent being to maximize this revenue.

      The revenue is therefore maximal when the chosen action makes it possible to obtain a number of devices attempting access Image equal to the optimum Image. However, as the measurement Image is marred by noise, this impacts the measured revenue.

      The objective of such a system is to find the blocking probability, making it possible to maximize the average recompense, which amounts to reducing the distance between the measurements of the number of terminals attempting access and the optimum. To meet this objective, we rely on the TD3 algorithm.

      The TD3 algorithm is an actor-critic approach, where the actor is a network of neurons which decides the action to take in a particular state; the main network makes it possible to know the value of being in a state and to choose a particular action. TD3 makes it possible to resolve the question of over-evaluation in estimating the value (Thrun and Schwartz 1993) by introducing two critical networks and by taking the minimum between these two estimations. This approach is particularly beneficial in our case due to the inherent presence of measurement errors.

      2.5.2. Regulation system for arrivals

      The diagram in Figure 2.6 describes the system that makes it possible to control the number of attempts from IoT objects. This system is based on the diffusion of the blocking factor at the terminals, through the SIBs which are broadcasted, and more particularly through the Type14 SIB block, which makes it possible to diffuse the access blocking parameters (ETSI 2019).

      Following the reception of the blocking factor, the terminals wishing to carry out transmission execute the ACB, which allows them to pass to the following stages with a probability p, which is calculated by our TD3 based controller. These terminals can, consequently, attempt access by choosing a preamble at random from among the available preambles. Knowing the state of the preambles, the gNodeB can estimate the number of attempts made. This measure is very noisy, since the model given only makes it possible to estimate averages. We take an average estimate of the number of devices. We use a sliding average to do this.

      The controller, we have proposed, receives these measurements, augmented from the revenue, at the end of each preamble. The revenue obtained enables it to know the quality of the actions taken. These different data are placed in a memory of past experiences. This is a random sub-set of this memory that will enable it to learn robustly and to choose, subsequently, a new action.

      These different actions are repeated cyclically.

      Figure 2.6. System for regulating arrivals

      We have considered an NB-IoT antenna in which access requests arrive according to a Poisson distribution with an average rate between two arrivals of 0.018 s. We have considered a number of preambles N equal to 16, with an arrival frequency equal to 0.1 s. In the system considered, each device attempting access will be able to do so a maximum of 16 times. Beyond this limit, the terminal abandons transmission.

      Our controller’s performance, which is based on the TD3 technique, is compared to an adaptive approach. We have considered a measurement horizon H equal to 10. Use of a larger measurement window does not allow a significant improvement in performances, which means that a window of 10 measurements makes it possible to reflect sufficiently the real state of the network.

      The adaptive approach consists of gradually increasing the blocking probability when the number of attempts is beyond a predefined threshold above the optimal value. When a value is below a predefined threshold below the optimal value, the blocking probability is gradually reduced, to allow more terminals to attempt access.

      In Figures 2.7 and 2.8, the blocking probabilities for both strategies considered are expressed. The adaptive technique (Figure 2.7) starts with an access probability of 1 and adapts itself according to the traffic conditions, which change following a Poisson distribution. For the strategy, which is based on the TD3 algorithm, there is an initial stage lasting 200 s, where the algorithm tries to explore the action space according to a uniform law (Figure 2.8). It is only after this stage that the algorithm begins to make use of its learning, which is refined in line with its experiences.

      We can note that under TD3 (Figure 2.8), future actions have no links with past actions, unlike the adaptive case. In fact, the values of the actions can change completely, because they depend only on the state of the network, which can change very quickly.

Graph depicts the access probability with the adaptive controller.

      Figure 2.7. Access probability with the adaptive controller

Schematic illustration of the access probability with the controller using TD3.

      Figure 2.8 Access probability with the controller using TD3

      Figure 2.9. Average latency of the terminals with the adaptive controller

Graph depicts the average latency of the terminals with the controller using TD3.

      Figure 2.10. Average latency of the terminals with the controller using TD3

Скачать книгу