Скачать книгу

Gittins index thus provides an amazingly straightforward solution to the multi-armed bandit problem. But it doesn’t necessarily close the book on the puzzle, or help us navigate all the explore/exploit tradeoffs of everyday life. For one, the Gittins index is optimal only under some strong assumptions. It’s based on geometric discounting of future reward, valuing each pull at a constant fraction of the previous one, which is something that a variety of experiments in behavioral economics and psychology suggest people don’t do. And if there’s a cost to switching among options, the Gittins strategy is no longer optimal either. (The grass on the other side of the fence may look a bit greener, but that doesn’t necessarily warrant climbing the fence—let alone taking out a second mortgage.) Perhaps even more importantly, it’s hard to compute the Gittins index on the fly. If you carry around a table of index values you can optimize your dining choices, but the time and effort involved might not be worth it. (“Wait, I can resolve this argument. That restaurant was good 29 times out of 35, but this other one has been good 13 times out of 16, so the Gittins indices are … Hey, where did everybody go?”)

      In the time since the development of the Gittins index, such concerns have sent computer scientists and statisticians searching for simpler and more flexible strategies for dealing with multi-armed bandits. These strategies are easier for humans (and machines) to apply in a range of situations than crunching the optimal Gittins index, while still providing comparably good performance. They also engage with one of our biggest human fears regarding decisions about which chances to take.

      Regret and Optimism

      Regrets, I’ve had a few. But then again, too few to mention.

      —FRANK SINATRA

      For myself I am an optimist. It does not seem to be much use being anything else.

      —WINSTON CHURCHILL

      If the Gittins index is too complicated, or if you’re not in a situation well characterized by geometric discounting, then you have another option: focus on regret. When we choose what to eat, who to spend time with, or what city to live in, regret looms large—presented with a set of good options, it is easy to torture ourselves with the consequences of making the wrong choice. These regrets are often about the things we failed to do, the options we never tried. In the memorable words of management theorist Chester Barnard, “To try and fail is at least to learn; to fail to try is to suffer the inestimable loss of what might have been.”

      Regret can also be highly motivating. Before he decided to start Amazon.com, Jeff Bezos had a secure and well-paid position at the investment company D. E. Shaw & Co. in New York. Starting an online bookstore in Seattle was going to be a big leap—something that his boss (that’s D. E. Shaw) advised him to think about carefully. Says Bezos:

      The framework I found, which made the decision incredibly easy, was what I called—which only a nerd would call—a “regret minimization framework.” So I wanted to project myself forward to age 80 and say, “Okay, now I’m looking back on my life. I want to have minimized the number of regrets I have.” I knew that when I was 80 I was not going to regret having tried this. I was not going to regret trying to participate in this thing called the Internet that I thought was going to be a really big deal. I knew that if I failed I wouldn’t regret that, but I knew the one thing I might regret is not ever having tried. I knew that that would haunt me every day, and so, when I thought about it that way it was an incredibly easy decision.

      Computer science can’t offer you a life with no regret. But it can, potentially, offer you just what Bezos was looking for: a life with minimal regret.

      Regret is the result of comparing what we actually did with what would have been best in hindsight. In a multi-armed bandit, Barnard’s “inestimable loss” can in fact be measured precisely, and regret assigned a number: it’s the difference between the total payoff obtained by following a particular strategy and the total payoff that theoretically could have been obtained by just pulling the best arm every single time (had we only known from the start which one it was). We can calculate this number for different strategies, and search for those that minimize it.

      In 1985, Herbert Robbins took a second shot at the multi-armed bandit problem, some thirty years after his initial work on Win-Stay, Lose-Shift. He and fellow Columbia mathematician Tze Leung Lai were able to prove several key points about regret. First, assuming you’re not omniscient, your total amount of regret will probably never stop increasing, even if you pick the best possible strategy—because even the best strategy isn’t perfect every time. Second, regret will increase at a slower rate if you pick the best strategy than if you pick others; what’s more, with a good strategy regret’s rate of growth will go down over time, as you learn more about the problem and are able to make better choices. Third, and most specifically, the minimum possible regret—again assuming non-omniscience—is regret that increases at a logarithmic rate with every pull of the handle.

      Logarithmically increasing regret means that we’ll make as many mistakes in our first ten pulls as in the following ninety, and as many in our first year as in the rest of the decade combined. (The first decade’s mistakes, in turn, are as many as we’ll make for the rest of the century.) That’s some measure of consolation. In general we can’t realistically expect someday to never have any more regrets. But if we’re following a regret-minimizing algorithm, every year we can expect to have fewer new regrets than we did the year before.

      Starting with Lai and Robbins, researchers in recent decades have set about looking for algorithms that offer the guarantee of minimal regret. Of the ones they’ve discovered, the most popular are known as Upper Confidence Bound algorithms.

      Visual displays of statistics often include so-called error bars that extend above and below any data point, indicating uncertainty in the measurement; the error bars show the range of plausible values that the quantity being measured could actually have. This range is known as the “confidence interval,” and as we gain more data about something the confidence interval will shrink, reflecting an increasingly accurate assessment. (For instance, a slot machine that has paid out once out of two pulls will have a wider confidence interval, though the same expected value, as a machine that has paid out five times on ten pulls.) In a multi-armed bandit problem, an Upper Confidence Bound algorithm says, quite simply, to pick the option for which the top of the confidence interval is highest.

      Like the Gittins index, therefore, Upper Confidence Bound algorithms assign a single number to each arm of the multi-armed bandit. And that number is set to the highest value that the arm could reasonably have, based on the information available so far. So an Upper Confidence Bound algorithm doesn’t care which arm has performed best so far; instead, it chooses the arm that could reasonably perform best in the future. If you have never been to a restaurant before, for example, then for all you know it could be great. Even if you have gone there once or twice, and tried a couple of their dishes, you might not have enough information to rule out the possibility that it could yet prove better than your regular favorite. Like the Gittins index, the Upper Confidence Bound is always greater than the expected value, but by less and less as we gain more experience with a particular option. (A restaurant with a single mediocre review still retains a potential for greatness that’s absent in a restaurant with hundreds of such reviews.) The recommendations given by Upper Confidence Bound algorithms will be similar to those provided by the Gittins index, but they are significantly easier to compute, and they don’t require the assumption of geometric discounting.

      Upper Confidence Bound algorithms implement a principle that has been dubbed “optimism in the face of uncertainty.” Optimism, they show, can be perfectly rational. By focusing on the best that an option could be, given the evidence obtained so far, these algorithms give a boost to possibilities we know less about. As a consequence, they naturally inject a dose of exploration into the decision-making process, leaping at new options with enthusiasm because any one of them could be the next big thing. The same principle has been used, for instance, by MIT’s Leslie Kaelbling, who builds “optimistic robots” that explore the space around them by boosting the value of uncharted terrain. And it clearly has implications for human lives as well.

      The success of Upper Confidence Bound algorithms offers a formal justification for

Скачать книгу