The Exploration/Exploitation Framework

We all have a variety of mental models that we use to interpret the world around us. In many cases we have very specific models, e.g. I know that turning the key in the ignition makes my car turn on, and this does not assist me very much in my understanding of the world. At the same time, the concept of different keys fitting into different locks is a metaphor that we apply to other areas of life.

One of my favorite models comes from reinforcement learning, which is particularly applicable to how our brain functions. In the most general case, assume that you have many different options, and each of these gives you a payoff randomly selected from an unknown distribution. Your goal is to maximize the payoff you receive over a fixed time horizon. This type of game is epitomized by the multi-armed bandit problem, or the Wisconsin Card Sorting Test when that distribution changes over time.

So how do you optimize these types of tasks?

In the beginning, you have no idea what the distribution is on any of the options, so naturally the first step is to sample each of them to obtain data. This process is known as exploration. Then, once we know which distribution has the highest expected value, we can simply pick that option until the end of the game to maximize our payoffs. This process is known as exploitation.

That explanation makes it sound trivially easy, but once dig down into that process a little further you will see that it is covering up an enormous world of decision making…

In the multi-armed bandit problem, we begin with complete uncertainty about the distributions of each of the options. Given that we first have to learn anything about the distributions, the only option is to begin an initial exploration phase. So we start by sampling each option once, and let’s say each one gives a different response. In theory, we could begin the exploitation phase immediately, simply picking the option with the highest value for the rest of the game. I suspect you immediately came up with an objection to this idea – “Wait a minute, we only have a single data point here, how could we possibly have characterized that distribution! It takes at least two points to define a line, much less a higher dimensional space!” And yet, to do otherwise means you’re going against your highest expected value option.

Thus you perceive the first winkle: there is a fundamental tradeoff between exploration and exploitation. Every turn we decide to sample another distribution than the one we perceive to be the highest value, we are foregoing the opportunity to score more points.

Notice that we don’t just have uncertainty about the payoff structure of each option, but we additionally have uncertainty about how long we should spend exploring versus pursuing the best option we know about. This gets at the heart of why I find this framework so useful: it is particularly adept for modeling an uncertain world.

Like most of these types of models, the multi-armed bandit actually abstracts away a lot of complexity in the real world. We are born into this world with virtually no information about how it works, save for an accumulation of genetic, epigenetic, and prenatal environmental signals (and don’t get me wrong, this is potentially useful data). Not only do we not know what the payoffs to the various options are, we don’t even know what the options are or how many of them even exist! Furthermore, the distributions themselves can change (and we don’t know how fast each one changes), meaning that our previous conclusions become weaker over time! That alone excludes any clear cut rule about how long to explore before exploiting, instead requiring continual updating on the fly. The multi-armed bandit assumes there is one unified payoff, but in practice people have drives for different types of things – money, power, sex, love, freedom, etc. – and each of these options may give different amounts of each… and over the course of our lives we may find that our very preferences are changing! We don’t even know the exact ending period, though we have a pretty good idea – barring, of course, radical life extension technology that we need to take into account. We are uncertain about our level of uncertainty at every step!

Getting a handle on this kind of uncertainty is precisely why I like models so much. Insofar as reality does resemble these sorts of games, the conclusions we draw about them (and the solutions we devise for them) can be useful for thinking about how we approach life. By thinking about exploration/exploitation tradeoffs, we realize that we need to be collecting evidence about the following, among many other things:

  • How many different options exist, and what are they? Will new ones develop later on?
  • What payoffs do we receive from each of these options? How do those change over time? What do we expect them to do in the future?
  • How much of our time and attention should we invest in exploring these options? Can we rule out some of them more quickly than others? How often should we go back and take another look at something we previously rejected?
  • How long should we spend trying to figure out what our complex preferences actually are? How much do our preferences change over time? How much do we expect them to change in the future?
  • Does the neuroplasticity of our brain decrease over time, shifting the balance towards increased exploitation? Is this inevitable or can we change this using behavioral or pharmacological techniques?
  • Are we going to have a normal human lifespan? Will that lifespan radically increase over time? Will technology allow the information stored in our brain to be recovered after our death?
  • How does our general decision making change in response to increased uncertainty?
  • How long should we spend thinking about exploration/exploitation tradeoffs?

In practice, your brain is a complex adaptive neural network which will approximately Bayesian update on this evidence as it comes in, mediated by an emotional response to each of the available options. Over time you will begin to develop good feelings or bad feelings about your options, based on your historical experience. Doing good things is more rewarding than doing bad ones, so we pick those options and stick with them. For the most part the system works relatively well – when we’re operating inside our evolutionary parameters at least – and it has the bonus feature of being capable of running without our conscious input. Our brain is capable of processing a truly massive amount of information, and we would be foolish to ignore this potential.

And yet, we don’t live in the ancestral environment any more. In an unchanging world, it was enough to observe what your forebears did, and copy them. You could learn about all the useful plants in your immediate vicinity, where to find them, how to build tools, how to hunt animals… This is the world for which our brains are designed. Instead we live in a world of constant flux, of uncertainty, surrounded by more people than we could ever associate with. Our default responses will pull us in a thousand directions, drifting aimlessly from stimulus to stimulus. Humans are not automatically strategic, in a world where we desperately need to be. In the modern environment we can do much better than our default response, by explicitly reflecting on our decision making processes.

In the next blog post, I will talk about how I have applied this framework to my own thinking. Until then, what do you think about this model? How would you extend it? Do you use it in your own life?

  • Pingback: What I Wish I Knew in College()

  • Aleister Mraz

    This is bullshit. Hunting/Gathering is not an unchanging process. Have you ever been outside? Meditate on the antlers of a deer for just a second, man.

    “And yet, we don’t live in the ancestral environment any more. In an unchanging world, it was enough to observe what your forebears did, and copy them. You could learn about all the useful plants in your immediate vicinity, where to find them, how to build tools, how to hunt animals… This is the world for which our brains are designed. Instead we live in a world of constant flux, of uncertainty, surrounded by more people than we could ever associate with. Our default responses will pull us in a thousand directions, drifting aimlessly from stimulus to stimulus. Humans are not automatically strategic, in a world where we desperately need to be. In the modern environment we can do much better than our default response, by explicitly reflecting on our decision making processes.