General notes on second edition (updated June 23, 2011)
Chapter 1 - Introduction (updated June 23, 2011)
Chapter 2 - Some Illustrative Models (updated November 12, 2012)
Chapter 3 - Introduction to Markov Decision Processes (updated May 30, 2012)
Chapter 4 - Introduction to Approximate Dynamic Programming (updated June 23, 2011)
Chapter 5 - Modeling Dynamic Programs (updated June 23, 2011)
Chapter 6 - Policies (updated July 31, 2011)
Chapter 7 - Policy Search (updated March 13, 2015)
Chapter 8 - Approximating Value Functions (updated June 23, 2011)
Chapter 9 - Learning Value Function Approximations (updated June 23, 2011)
Chapter 10 - Optimizing While Learning (updated June 23, 2011)
Chapter 11 - Adaptive Estimation and Stepsizes (updated June 23, 2011)
Chapter 12 - Exploration vs. Exploitation (updated June 23, 2011)
Chapter 13 - Value Function Approximations for Resource Allocation Problems (updated June 23, 2011)
Chapter 14 - Dynamic Resource Allocation Problems (updated June 23, 2011)
Chapter 15 - Implementation Challenges (updated June 23, 2011)
The second edition of Approximate Dynamic Programming is a major revision, partly reflecting the growth of the field, but primarily reflecting what I learned since 2005, when major addition to the first edition drew to a close. Major changes in the second edition include:
The usual introductory fluff. This chapter sets the tone of the book and the audience. I try to set this book in the context of the rest of the literature.
There is tremendous richness in modeling stochastic, dynamic problems, as we show in chapter 5. In this chapter, we try to teach modeling by example. This chapter illustrates both deterministic and stochastic problems. No effort is made to explain the principles of modeling, but here we try to teach modeling by example. At the end of the chapter is a half page summary of the core elements of most stochastic, dynamic systems.
Page 33, line 10 from bottom: "we would be willing to pay up to $300" should be "we would be willing to pay up to $370"
Overview of the chapter:
This is a streamlined introduction to the classic theory of discrete state, discrete action Markov decision processes, where we assume that we have access to a one-step transition matrix. The theory is quite elegant, but the algorithms are extremely limited in terms of their range of applications. However, the fundamental algorithms (value iteration, policy iteration) underlie most of approximate dynamic programming. More advanced students will enjoy the convergence proofs in the appendix.
The example states "Let R_t be the current inventory." While I often use "R_t" as a state variable for resources, in this example it should read "Let S_t be the current inventory."
Overview of the chapter:
This chapter, which was almost completely rewritten for the second edition, could be called "Introduction to reinforcement learning." By the end of this chapter, you will have a good working idea of the most important ideas in approximate dynamic programming, presented from the perspective of the reinforcement learning community. The chapter starts with Bellman's equation, and discusses the computational challenges that arise because of 1) large state spaces, 2) computing expectations (or the one-step transition matrix), and 3) large action spaces (especially if the decision is a vector). These are known as the three curses of dimensionality.
Q-learning, SARSA and other learning issues
The chapter then introduces Q-learning and SARSA, the concepts of on-policy and off-policy learning and the exploration-exploitation problem. The algorithmic strategy known as "real-time dynamic programming" is introduced which is a highly specialized algorithm which overcomes the exploration problem but assumes that the expectation can be computed.
Approximate value iteration
Value iteration is presented next using a Monte Carlo-based method for approximating the expectation. This sets up the most fundamental ADP algorithm that is the most intuitive and easiest to implement. However, this algorithm will not generally work in practice (the presentation does not make this point very clearly). For example, the algorithm is described as an on-policy algorithm, which can easily become stuck in a subset of states (we deal with this point in much greater detail later in the book).
The post-decision state
We then provide an expanded discussion of the post-decision state, with a large number of examples. The idea of the post-decision state has been used by other authors, sometimes under different names ("after state", "end of period state"), but my sense is that this idea has been largely overlooked by the RL/ADP literature. The post-decision state is the state measured just after you have made a decision, but before any new information has arrived. For some problems, the post-decision state is simpler (sometimes dramatically so) than the pre-decision state, which can simplify the process of approximating the value function. It is almost always simpler than estimating a Q-factor. Most importantly, it eliminates the need to compute (or approximate) the expectation within the max/min operator. This makes it much easier to run an ADP/RL algorithm.
The RL community often focuses on "model free" dynamic programming. Q-learning only requires that we observe the next state, so we do not need a transition function, or the need to compute an expectation, when estimating the value of a state action pair. A state-action pair is actually a kind of post-decision state variable, but it can be much clumsier than a post-decision state, although this depends on the problem. In the book, I present about 10 examples of the post-decision state.
Overview of the chapter:
40 pages on how to model a dynamic program. This chapter covers subtle issues such as the modeling of time. When modeling a stochastic system, it is especially important to know what is, and is not, random at a point in time t.
It introduces the five core components of any dynamic system, including
This chapter also covers topics such as partially observable states, model-based and model-free dynamic programming.
I routinely find that the biggest weakness of application papers in stochastic optimization is a lack of precision in the modeling. This chapter provides a simple template for modeling problems that will provide a level of clarity and precision that is missing from most of the papers that I read.
June 29, 2011 - The objective function
The book focuses on an objective function which minimizes expected costs or maximizes expected rewards. This approach ignores risk, which arises in many applications. Of course, you can handle risk by simply adding a risk measure (such as variance) to the objective function, but a more interest approach introduces an explicit risk constraint.
A separate strategy focuses on worst-case performance. This perspective has been receiving attention under the name "robust optimization," although many authors claim that any solution to a stochastic optimization problem is "robust" relative to a solution that optimizes a deterministic approximation. The growing field of robust optimization assumes that we can place bounds on the outcomes of random variables. Instead of taking expectations, you use the worst random outcome that might happen.
Overview of the chapter:
I like to characterize the field of stochastic optimization as a jungle of methods, often presented in the academic research literature with an impenetrable style (admissible policies? filtrations? nonanticipativity constraints?). This chapter breaks down this jungle by describing four fundamental types of policies:
Policy function approximations and value function approximations both require approximating a function (!!). There are three fundamental ways of approximating a function:
Parametric models (for actions or states) introduce the art of coming up with the functional approximation. The science involves determining the parameters.
Although there are four fundamental classes of policies, it is possible to find hybrid policies by mixing and matching different strategies. For example, you can perform a lookahead policy by searching several steps into the future, and then using a value function approximation to limit the depth of the search. Myopic policies can be tweaked by adding tunable parameters, which can be viewed as a hybrid of a myopic policy and a policy function approximation.
July 31, 2011 -Cost function approximations
As I wrote up my four classes of policies in the summer of 2010, there was one class of policies that did not seem to fit nicely in one of these four categories. It occurred to me that I had overlooked an important class that I would group with myopic policies, and I am going to call this class cost function approximation (which nicely parallels policy function approximation and value function approximation). One way to make a myopic policy work better is to modify the costs in a heuristic way. For example, imagine that you are assign people (such as a surgeon) or a machine to tasks. Our cost function might prioritize putting the best person or machine on a task, but this might result in some lower priority tasks being delayed, which creates problems in the future. For this reason, we might heuristically add a bonus for covering tasks that have been delayed. The bonus is a tunable parameter.
This strategy can be combined with lookahead policies and policies based on value function approximations (both of which require a cost function to minimize), but in its purest form would be used with a myopic policy, where we minimize a one-period cost. In principle, cost function approximations can be lookup tables, parameteric or nonparametric, although I have only used parametric approximations in the form of adding tunable costs. We can then optimize over a class of cost function approximation by finding the tunable cost (or bonus) that works the best over time.
Overview of the chapter:
In this chapter, we assume we have some sort of parameterized policy. This might be a myopic policy with tunable parameters, a policy function approximation, or even a value function approximation with tunable parameters. In this chapter, we present methods for tuning these parameters using concepts from the field of stochastic search.
The chapter starts with classical search methods based on stochastic gradients, assuming that derivatives can be computed. In the appendix, proofs are presented for those interested in seeing the theory behind these techniques.
The chapter then introduces search methods assuming a finite set of policies, also known as the ranking and selection problem. Frequentist and Bayesian approaches are presented, including Bayesian updating with correlated beliefs (a powerful idea we return to). Various heuristic search algorithms are presented including epsilon-greedy, Bolzmann exploration and interval estimation.
We then introduce the knowledge gradient, which is a general search technique which chooses a set of parameters to evaluate which produce the greatest marginal value. The knowledge gradient is given for discrete alternatives with independent and correlated beliefs, as well as when a parametric belief model is used (in particular, linear regression). The knowledge gradient is a Bayesian method for learning the best set of parameters. This chapter provides only a brief introduction to this powerful method. For more information, see our optimal learning website.
We next present some important ideas from the literature known as "simulation optimization." These methods were derived in the setting of finding a set of parameters (typically a finite set) to produce the best results from a Monte Carlo simulation. We illustrate two important ideas - indifference zone selection, and the optimal computing budget allocation, which seeks to allocate a finite budget for computing time among a small set of alternatives. These methods are all frequentist.
p. 285 - The conclusion that \beta^n goes to zero requires that we modify the assumption in equation 7.69 (p. 281) that the stepsizes alpha_n be strictly greater than 0 (rather than greater than or equal to zero).
Overview of the chapter:
Chapter 8 is the first of a three-chapter sequence to develop the important ideas of finding policies based on approximating value functions. This chapter can be viewed as a review of statistical techniques that have proved popular in the ADP/RL communities for approximating value functions.
Major techniques include:
The curse of dimensionality
We provide a section on the curse of dimensionality. ADP is sometimes presented as a method for overcoming the curse of dimensionality, which we believe is a little disengeneous. Approximating an N dimensional function in a way that captures interactions between all N dimensions is fundamentally difficult for values of N larger than about 5. Approximations of high-dimensional functions typically requires using separable or sequences of low-dimensional approximations. ADP/RL has not solved this problem.
The classical discrete state (also known as a flat representation) does not use separable (or low-dimensional) approximations, but that is why classical discrete state models do not scale to interesting problems.
Overview of the chapter:
There are many papers that focus on the problem of estimation (sometimes called prediction), which means estimating the value of being in a state while following a fixed policy. This entire chapter focuses on this problem. To further emphasize that we are fixing the policy, any quantity that depends on a policy is indexed by pi.
Our presentation starts with an overview of different methods for taking samples of the value of being in a state. This includes sampling the value of a policy over finite and infinite horizons, and bootstrapping, known widely in the reinforcement learning community as "temporal differencing".
We then introduce stochastic approximation methods for updating the parameters of a function. We then present recursive least squares for linear regression models, including updating methods for stationary, nonstationary and autoregressive methods.
We then present an in-depth discussion of Bellman's equation using linear models. We present a matrix-based derivation, a simulation-based algorithm, and presentations of least squares temporal differencing (LSTD) and least squares policy evaluation (LSPE).
We then provide an analysis of different algorithms (recursive least squares, TD(0), LSPE, LSTD) for a simplified dynamic program with a single state and single action. This simple problem allows us to give each of these estimation methods in closed form. This allows us to understand the type of stepsize rules that each requires.
The chapter then presents some exciting new material on gradient based methods for approximate value iteration. Classical methods for estimating the parameters of a value function approximation have computed gradients based on weighted squares of Bellman errors. While popular, this method has known limitations, including divergence for some applications. New research has proposed using a different objective function called the "mean squared *projected* Bellman error", or MSPBE for short. This method leads to a provably convergent algorithm, overcoming the divergent behavior of previous algorithms.
The chapter closes with a presentation of recent research on least square temporal differencing using kernel regression.
Overview of the chapter:
Learning a value function approximations while sampling observations from a policy that depends on the value function approximation is ... well, hard!
The issues that arise when learning while optimizing depends heavily on how we sample observations of the value of being in a state, and how we are trying to approximate the value function.
There are two fundamental methods for sampling the value of being in a state. The first uses bootstrapping, where the value of being in a state s depends on the best action we can take and our current approximation of the value of the state that the action takes us to. This is known as TD learning and is used in approximate value iteration.
The second method simulates a policy (directly or indirectly) and uses the resulting sequence of costs/rewards to get an observation of the value of being in a state. This simulation can be done by literally walking forward through time (perhaps to the end of a horizon), or by inferring the infinite horizon contribution from a single cost. This strategy is best recognized under the name of approximate policy iteration, but methods such as least squares temporal differencing accomplish the same thing.
Of particular importance is how we are approximating the value function. Recall that there are three fundamental strategies: lookup tables, parametric models, and nonparametric models. If we use a lookup table, we benefit from the fact that if we sample states and actions often enough, we will eventually learn the best value of being in a state. The key here is that we need an exploration strategy to ensure this, which means we are using off-policy learning.
Of greatest interest in the ADP community is the use of linear regression (the easiest form of parametric model). Sadly, even if we fix a policy, if we use off-policy learning, we will not learn the correct value function approximation. We have to use on-policy learning. At the same time, we do not have the same needs for exploration that are required with lookup tables. Instead of visiting all states infinitely often, we need to visit enough different states to fit the parameters. Very recent research using mean squared projected Bellman error introduces an important correction term that makes it possible to use off-policy learning. However, there are some applications where on-policy learning (where we follow a single trajectory produced by a policy which depends on the value function approximation), is the most practical.
Approximate policy iteration has attracted considerable attention because it enables trajectory following in an inner iteration where we fix the value function approximation (to fix the policy) and then simulate the policy. This has to be done many times to get a good approximation of the value function for this policy, which is then used to update the policy. Convergence proofs are available, but these require assumptions either on the completeness of the basis functions or the behavior of the policy in terms of exploring states.
There is growing interest in the use of nonparametric models to avoid the "art" of choosing basis functions (also known as features). Nonparametric methods use local observations of a function to approximate the function in a region. This means that we lose the power of parametric methods where observations in one region of the function allow us to update our approximation of the entire function. This is both a weakness and a strength, because when one observation updates the entire function, it introduces instabilities. The flip side is that we are now going to need exploration policies to ensure that we visit different portions of the function.
Overview of the chapter:
Fundamental to approximate dynamic programming is the need for recursive estimation, where a current estimate is combined with a new observation to produce an updated estimate. By far the most popular method for doing this uses a stochastic approximation algorithm, where (1-alpha_{n-1}) is multipled times the old estimate and added to alpha_{n-1} times the new observation. alpha_{n-1} is known as a stepsize (given the derivation from stochastic approximation theory), smoothing factor or a learning rate.
A stepsize of 1/n (which has the effect of averaging observations) satisfies the properties needed for convergence proofs, but has long been recognized to decline too quickly (however, under certain conditions it can be the best possible stepsize). A harmonic stepsize of a/(a+n) reduces the rate of decrease by increasing "a".
The reinforcement learning community most often uses a constant stepsize in practice, primarily because it avoids the practical problem of using a declining stepsize that declines too quickly. This choice introduces a tunable parameter, and it also forces a single stepsize on the entire problem. States (or regions of the state space) with few observations benefit from a larger stepsize, while a smaller stepsize produces more accurate estimates.
This chapter reviews deterministic stepsizes, heuristic stochastic (adaptive) stepsize rules, and reviews several stepsize rules that satisfy different optimality criterion for rate of convergence. We review three of these optimal stepsize rules:
Overview of the chapter:
When I wrote my "Exploration vs. Exploitation" chapter in the first edition, I realized how little I knew about this topic, and decided to make it a major area of research. This research is evolving under the name "Optimal Learning" (see http://optimallearning.princeton.edu). Some of this work is evident in chapter 7 for policy search, where we use it to help search for the best value of parameters that fix a policy. In this setting, learning is a dynamic programming problem where the state is our state of belief about the parameters. After each measurement, I can test any value of the parameters I would like.
In this chapter, we tackle the much more difficult problem of learning in the presence of a physical state (which characterizes all the dynamic programming problems in this book). The chapter starts with some basic introduction to learning (frequentist and Bayesian), and reviews some of the popular heuristic policies (although this review is incomplete - more on this in additional updates to this website).
We then review the classic theory of bandit problems and Gittins indices, which is one of the earliest contributions to active learning. We also review approximations of Gittins indices, and the technique of upper confidence bounding.
We then review the knowledge gradient policy, first introduced in chapter 7, briefly reviewing knowledge gradient for correlated beliefs, as well as on-line problems (multiarmed bandits) and off-line prolems (ranking and selection). Often overlooked in the ADP literature is that algorithms for training policies are fundamentally online, because they consider the cost/reward being earned, rather than purely the value of information.
We then present some new material for learning in the presence of a physical state. We review an elegant heuristic called the "local bandit approximation" which is an adaptation of Gittins indices to problems with a physical state. We then show how the knowledge gradient can be used, where it results in adding a "value of information" term to the objective when making a decision. We use the ability of the knowledge gradient to visit one state while learning about another state (either because of correlated beliefs for discrete states, or through parametric models). The supporting literature behind this material is quite young, and this material should be viewed as emerging research. Early numerical work is promising, but the algorithm is considerably more complicated than the simple heuristics that are used in practice. Not know at this time is whether the computational requirements of the knowledge gradient calculation is offset by the benefits of the algorithm.
Overview of the chapter:
There is a large problem class that is best described as "resource allocation" where we are managing blood, money, people, water, energy or food. We have to decide not just what to do with a resource, but we usually have to deal with how much. These problems are typically convex (concave if maximizing) in the "how much" dimension, and we can exploit this property.
In this chapter, we focus on different methods for approximating value functions where we can take advantage of convexity. Approximation strategies might be linear in the resource state, piecewise linear separable, quadratic or they may use a powerful idea from mathematical programming which uses a sophisticated idea known as Benders cuts. The use of multidimensional Benders cuts is also known as "stochastic dual decomposition procedure" (SDDP) which is well known in the stochastic programming community. This can be viewed as a form of approximate dynamic programming.
We have considerable experience with stochastic resource allocation problems in our work in CASTLE Lab. Most of our applications use approximations that are piecewise linear and separable in the resource variable. This method has proven to be very robust and scalable. In our comparisons against Benders cuts, we find that Benders cuts works well up to around 10 or 20 dimensions. Above this, the method seems to become quite slow. Piecewise linear separable approximations seems to work well for problems with hundreds and thousands of dimensions. For our largest problems (millions of dimensions), the number of resources of a particular type becomes so small that we end up using approximations that are linear in the resource state.
Overview of the chapter:
Here we describe a series of resource allocation problems of increasing complexity. We start with a simple inventory problem (the resource state is scalar), and then present a blood management problem that nicely illustrates how we use the post-decision state to solve a multidimensional resource allocation problem. This basic strategy generalizes to much larger problems. We also have an illustration of a financial portfolio optimization problem. Again, this is done primarily to illustrate an approximation strategy.
Throughout this chapter, we focus on optimization problems with vector-valued decisions, so we use "x" throughout as our decision variable.
Section 12.4 stops to present a general notational strategy for a fairly broad class of resource allocation problems. Our hope is that the notational style will prove useful for some modelers.
Sections 12.5 and 12.6 review two large, industrial applications arising in the management of box cars for a major railroad, and the planning of drivers (a complex resource) for a truckload motor carrier. These projects are primarily demonstrating that these methods will work in practice.
Overview of the chapter:
Anyone who works with real applications learns that there is a real art to solving this very broad class of problems. I have made an effort to summarize some of the insights that I have acquired from the applications that I have worked on.
Chapter 15 is only a minor revision from the first edition, where my focus was primarily on using value function approximations to solve problems, in the contextual domain of resource allocation. I have some general insights into whether value functions is a good strategy for your problem and steps for debugging an algorithm. I provide a random list of thoughts on discount factors, getting through the early iterations of an ADP algorithm (when value functions and policies are the worst), evaluating a policy, and dealing with the challenge of when to stop an algorithm.
I close with a section called "If it works, patent it!" Although I would actually discourage the community from patents (which simply discourages innovation in my view), the point is that getting an ADP algorithm to really work is a real breakthrough. There is a lot of theory and toy experiments in labs, but the transition to a real field implementation is quite difficult.
Good luck!