Скачать книгу

certain characteristics more or less weight). In the first levels of a deep neural network, for example, if there is an attention system, the mapping made up of those first layers can be used as a framework for selecting features. On the other hand, an asymptotic bias can occur if the function approximator used for the weighted sum and/or the rule and/or template is too basic. But on the other hand, there would be a significant error due to the limited size of the data (over fitting) when the feature approximator has weak generalization.

      An especially better decision of a model-based or model-free method identified as a leading function approximator choice may infer that the state’s y-coordinate is less essential than the x-coordinate, and generalize that to the rule. It is helpful to share a performant function approximator in either a model-free or a model-based approach depending on the mission. Therefore the option to focus more on one or the other method is also a key factor in improving generalization [13, 19].

      One solution to eliminating non-informative characteristics is to compel the agent to acquire a set of symbolic rules tailored to the task and to think on a more extreme scale. This abstract level logic and increased generalization have the potential to activate cognitive high-level functions such as analogical reasoning and cognitive transition. For example, the feature area of environmental may integrate a relational learning system and thus extend the notion of contextual reinforcement learning.

      1.2.3.1 Auxiliary Tasks

      In the era of successful reinforcement learning, growing a deep reinforcement learning agent with allied tasks within a jointly learned representation would substantially increase sample academic success.

      This is accomplished by causing genuine several pseudo-reward functions, such as immediate prediction of rewards (= 0), predicting pixel changes in the next measurement, or forecasting activation of some secret unit of the neural network of the agent.

      1.2.3.2 Modifying the Objective Function

      In order to optimize the policy acquired by a deep RL algorithm, one can implement an objective function that diverts from the real victim. By doing so, a bias is typically added, although this can help with generalization in some situations. The main approaches to modify the objective function are

       i) Reward shaping

      For faster learning, incentive shaping is a heuristic to change the reward of the task to ease learning. Reward shaping incorporates prior practical experience by providing intermediate incentives for actions that lead to the desired outcome. This approach is also used in deep reinforcement training to strengthen the learning process in environments with sparse and delayed rewards.

       ii) Tuning the discount factor

      When the model available to the agent is predicted from data, the policy discovered using a short iterative horizon will probably be better than a policy discovered with the true horizon. On the one hand, since the objective function is revised, artificially decreasing the planning horizon contributes to a bias. If a long planning horizon is focused, there is a greater chance of over fitting (the discount factor is close to 1). This over fitting can be conceptually interpreted as related to the aggregation of errors in the transformations and rewards derived from data in relation to the real transformation and reward chances [4].

      1.3.1 Value-Based Method

Schematic illustration of value based learning.

      Figure 1.4 Value based learning.

      1 Take the status picture, transform it to grayscale, and excessive parts are cropped.

      2 Run the picture through a series of contortions and pooling in order to extract the important features that will help the agent make the decision.

      3 Calculate each possible action’s Q-Value.

      4 To find the most accurate Q-Values, conduct back-propagation.

      1.3.2 Policy-Based Method

      In the modern world, the number of potential acts may be very high or unknown. For instance, a robot learning to move on open fields may have millions of potential actions within the space of a minute. In these conditions, estimating Q-values for each action is not practicable. Policy-based approaches learn the policy specific function, without computing a cost function for each action. An illustration of a policy-based algorithm is given by Policy Gradient (Figure 1.5).

      Policy Gradient, simplified, works as follows:

      1 Requires a condition and gets the probability of some action based on prior experience

      2 Chooses the most possible action

      3 Reiterates before the end of the game and evaluates the total incentives

      4 Using back propagation to change connection weights based on the incentives.

Schematic illustration of policy based learning.

      Figure 1.5 Policy based learning.

      1.4.1 Applications

      The ability to tackle a wide range of Deep RL techniques has been demonstrated to a variety of issues which were previously unsolved. A few of the most renowned accomplishments are in the game of backgammon, beating previous computer programmes, achieving superhuman-level performance from the pixels in Atari games, mastering the game of Go and beating professional poker players in the Nolimit Texas Hold’em Heads Up Game: Libratus and Deep stack.

      RL is also relevant to fields where one might assume that supervised learning alone, such as sequence prediction, is adequate. It has also been cast as an RL problem to build the right neural architecture for supervised learning tasks. Notice that evolutionary techniques can also be addressed for certain types of tasks. Finally, it should be remembered that deep RL has prospects in the areas of computer science in classical and basic algorithmic issues, such as the travelling salesman

Скачать книгу