Скачать книгу

to ensure that the coordination graph is fully cooperative, all agents share the payoff function u(a). We abuse the notation e both to index a local payoff function ue and to denote the subset of agents in its scope; ae is thus a local joint action, i.e., a joint action of this subset of agents. The decomposition of u(a) into local payoff functions can be represented as a factor graph [Bishop, 2006] (Figure 2.1); a bipartite graph containing two types of vertices: agents (variables) and local payoff functions (factors), with edges connecting local payoff functions to the agents in their scope.

      The main challenge in a CoG is that the size of the joint action space A grows exponentially with the number of agents. It thus quickly becomes intractable to enumerate all joint actions and their associated payoffs. Key to solving CoGs is therefore to exploit loose couplings between agents, i.e., each agent’s behavior directly affects only a subset of the other agents.

      Figure 2.1: A CoG with three agents and two local payoff functions. The factor graph illustrates the loose couplings that result from the decomposition into local payoff functions. In particular, each agent’s choice of action directly depends only on those of its immediate neighbors, e.g., once agent 1 knows agent 2’s action, it can choose its own action without considering agent 3.

      Figure 2.1 shows the factor graph of an example CoG in which the team payoff function decomposes into two local payoff functions, each with two agents in scope:

Image

      The local payoff functions are defined in Table 2.1. We use this CoG as a running example throughout this book. The local payoff functions, with their limited scopes, encode the loose couplings: each agent can only directly affect another agent when they share, i.e., are both in the scope of, a local payoff function. For example, if we fix the action for agent 2 to be 2, then agents 1 and 3 can decide upon their optimal actions independently, as they do not directly affect each other.

Image

      We now consider the multi-objective setting:

      • D and A are the same as in a CoG, but,

      • U = {u1,…,uρ} is now the set of ρ, d-dimensional local payoff functions. The total team payoff is the sum of local vector-valued payoffs: Image.

      For example, payoffs for a MO-CoG structured as in Figure 2.1, i.e.,

Image

      are provided in Table 2.2.

Image

      The only difference between a MO-CoG and a CoG is the dimensionality of the payoff functions. A CoG is thus a special case of a MO-CoG, i.e., a MO-CoG in which d = 1. Furthermore, when preferences are known, it may be possible to scalarize a MO-CoG and thus transform it into a CoG. For example, if we know the scalarization function is linear, i.e., f(u, w) = w · u, and its parameter vector w = (0.75,0.25) is, then we can scalarize the multi-objective payoff functions of Table 2.2 to the single-objective payoff functions of Table 2.1 before planning.

      In a deterministic policy for a MO-CoG, the agents pick one joint action. In a stochastic policy, the agents draw a joint action from a probability distribution. Note that such a probability distribution is in general defined over joint actions, and there can thus still be coordination between the agents when the policy is stochastic.

      When f and/or w are unknown in the planning or learning phase—as is the case in the unknown weights and decision support scenarios discussed in Section 1.1—we need to produce a set of policies that contains at least one optimal policy for each possible f and w. The solution of a MO-CoG is thus a coverage set of policies. In general, this can contain both deterministic and stochastic policies. We explain why this is important for MO-CoGs (but not for CoGs) in Chapter 3.

      The second class of MODPs that we treat is the multi-objective Markov decision process (MOMDP), in which a single agent needs to perform a sequence of actions over time. This sequence of actions typically takes place in an environment that is affected by these actions. Therefore, the agent has to consider not only its immediate reward, but also the reward it will be able obtain later in the states it visits in the future.

      Consider the robot (shown as a Pacman symbol) in Figure 2.2 that needs to navigate in a socially appropriate way to reach the blue target zone in the upper right corner. We want the robot to reach the target as soon as possible, i.e., minimize the time to reach the target, but also minimize the hindrance that the robot causes to the other person by avoiding her personal space (indicated in red) along the way. By driving through the personal space of the person, it can obtain a higher value with respect to the first objective but a lower value in the second objective. Which policy is optimal thus depends on how much we value the first objective relative to the second objective.

      Figure 2.2: A social robot navigation problem MDP: a robot (indicated by the pacman symbol) needs to reach the target zone in the upper right corner (objective 1). However, the robots needs to avoid the personal space of the person standing in between the start position of the robot and the target zone (objective 2).

      The single-objective version of an MOMDP is an MDP:

      • S is the state space, i.e., the set of possible states

Скачать книгу