Approximate Dynamic Programming

Much of our work falls in the intersection of stochastic programming and dynamic programming. The dynamic programming literature primarily deals with problems with low dimensional state and action spaces, which allow the use of discrete dynamic programming techniques. The stochastic programming literature, on the other hands, deals with the same sorts of higher dimensional vectors that are found in deterministic math programming. However, the stochastic programming community generally does not exploit state variables, and does not use the concepts and vocabulary of dynamic programming.

Our contributions to the area of approximate dynamic programming can be grouped into three broad categories: general contributions, transportation and logistics, which we have broadened into general resource allocation, discrete routing and scheduling problems, and batch service problems. A series of short introductory articles are also available.

General contributions:

Warren B. Powell. “Clearing the Jungle of Stochastic Optimization.” INFORMS Tutorials in Operations Research: Bridging Data and Decisions, pp. 109-137, November, 2014, http://dx.doi.org/10.1287/educ.2014.0128

This invited tutorial unifies different communities working on sequential decision problems. First, it provides a simple, five-part canonical form for modeling stochastic dynamic programs (drawing off established notation from the controls community), with a thorough discussion of state variables. It then summarizes four fundamental classes of policies called policy function approximations (PFAs), policies based on cost function approximations (CFAs), policies based on value function approximations (VFAs), and lookahead policies. There is a detailed discussion of stochastic lookahead policies (familiar to stochastic programming). It closes with a summary of results using approximate value functions in an energy storage problem.

(Click here to download paper)

Daniel Jiang, Thuy Pham, Warren B. Powell, Daniel Salas, Warren Scott, “A Comparison of Approximate Dynamic Programming Techniques on Benchmark Energy Storage Problems: Does Anything Work?,” IEEE Symposium Series on Computational Intelligence, Workshop on Approximate Dynamic Programming and Reinforcement Learning, Orlando, FL, December, 2014.

This paper uses two variations on energy storage problems to investigate a variety of algorithmic strategies from the ADP/RL literature. All of these methods are tested on benchmark problems that are solved optimally, so that we get an accurate estimate of the quality of the policies being produced. Somewhat surprisingly, generic machine learning algorithms for approximating value functions did not work particularly well. What did work well is best described as “lookup table with structure.” The structure we exploit is convexity and monotonicity. Test datasets are available at http://www.castle.princeton.edu/datasets.htm.

(Click here to download paper)

J. Nascimento, W. B. Powell, “An Optimal Approximate Dynamic Programming Algorithm for Concave, Scalar Storage Problems with Vector-Valued Controls,” IEEE Transactions on Automatic Control, Vol. 58, No. 12, pp. 2995-3010. http://dx.doi.org/10.1109/TAC.2013.2272973 (2013).

This paper proves convergence for an ADP algorithm using approximate value iteration (TD(0)), for problems that feature vector-valued decisions (e.g. allocating energy over a grid), linked by a scalar storage system, such as a water reservoir. The problem arises in settings where resources are distributed from a central storage facility. The algorithm is well suited to continuous problems which requires that the function that captures the value of future inventory be finely discretized, since the algorithm adaptively generates break points for a piecewise linear approximation. The strategy does not require exploration, which is common in reinforcement learning.

(Click here to download paper)

W. B. Powell, J. Ma, “A Review of Stochastic Algorithms with Continuous Value Function Approximation and Some New Approximate Policy Iteration Algorithms for Multi-Dimensional Continuous Applications,” Journal of Control Theory and Applications, Vol. 9, No. 3, pp. 336-352, 2011.

We review the literature on approximate dynamic programming, with the goal of better understanding the theory behind practical algorithms for solving dynamic programs with continuous and vector-valued states and actions, and complex information processes. We build on the literature that has addressed the well-known problem of multidimensional (and possibly continuous) states, and the extensive literature on model-free dynamic programming which also assumes that the expectation in Bellman’s equation cannot be computed. However, we point out complications that arise when the actions/controls are vector-valued and possibly continuous. We then describe some recent research by the authors on approximate policy iteration algorithms that offer convergence guarantees (with technical assumptions) for both parametric and nonparametric architectures for the value function.

(click here to download paper)

Ryzhov, I. and W. B. Powell, “Bayesian Active Learning with Basis Functions,” IEEE Workshop on Adaptive Dynamic Programming and Reinforcement Learning, Paris, April, 2011.

A common technique for dealing with the curse of dimensionality in approximate dynamic programming is to use a parametric value function approximation, where the value of being in a state is assumed to be a linear combination of basis functions. We propose a Bayesian strategy for resolving the exploration/exploitation dilemma in this setting. Our approach is based on the knowledge gradient concept from the optimal learning literature, which has been recently adapted for approximate dynamic programming with lookup-table approximations. The new method performs well in numerical experiments conducted on an energy storage problem.

(click here to download paper)

W.B. Powell, Approximate Dynamic Programming, John Wiley and Sons, 2007.

This is the first book to bridge the growing field of approximate dynamic programming with operations research. Dynamic programming has often been dismissed because it suffers from “the curse of dimensionality.” In fact, there are three curses of dimensionality when you deal with the high-dimensional problems that typically arise in operations research (the state space, the outcome space and the action space). This book shows how we can estimate value function approximations around the post-decision state variable to produce techniques that allow us to solve dynamic programs which exhibit states with millions of dimensions (approximately).

The book is aimed at an advanced undergraduate/masters level audience with a good course in probability and statistics, and linear programming (for some applications). For the advanced Ph.D., there is an introduction to fundamental proof techniques in “why does it work” sections. The book emphasizes solving real-world problems, and as a result there is considerable emphasis on proper modeling. The book includes dozens of algorithms written at a level that can be directly translated to code. The material in this book is motivated by numerous industrial applications undertaken at CASTLE Lab, as well as a number of undergraduate senior theses.

Powell, W. B., “Approximate Dynamic Programming I: Modeling,” Encyclopedia of Operations Research and Management Science, John Wiley and Sons, (to appear).

Powell, W. B., “Approximate Dynamic Programming II: Algorithms,” Encyclopedia of Operations Research and Management Science, John Wiley and Sons, (to appear).

These two short chapters provide yet another brief introduction to the modeling and algorithmic framework of ADP. The first chapter actually has nothing to do with ADP (it grew out of the second chapter). Instead, it describes the five fundamental components of any stochastic, dynamic system. There is also a section that discusses “policies”, which is often used by specific subcommunities in a narrow way. I describe nine specific examples of policies. I think this helps put ADP in the broader context of stochastic optimization.

The second chapter provides a brief introduction to algorithms for approximate dynamic programming. In the tight constraints of these chapters for Wiley’s Encyclopedia, it is not possible to do a topic like this justice in 20 pages, but if you need a quick peek into ADP, this is one sample.

(click here to download: ADP – I: Modeling)

(click here to download: ADP – II: Algorithms)

Ryzhov, I. O., W. B. Powell, “Approximate Dynamic Programming with Correlated Bayesian Beliefs,” Forty-Eighth Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, Sept. 29-Oct. 1, 2010.

The exploration-exploitation problem in dynamic programming is well-known, and yet most algorithms resort to heuristic exploration policies such as epsilon-greedy. We use a Bayesian model of the value of being in each state with correlated beliefs, which reflects the common fact that visiting one state teaches us something about visiting other states. We use the knowledge gradient algorithm with correlated beliefs to capture the value of the information gained by visiting a state. Using both a simple newsvendor problem and a more complex problem of making wind commitments in the presence of stochastic prices, we show that this method produces significantly better results than epsilon-greedy for both Bayesian and non-Bayesian beliefs. These are shown for both offline and online implementations.

(click here to download paper)

Powell, W.B., “Merging AI and OR to Solve High-Dimensional Resource Allocation Problems using Approximate Dynamic Programming” Informs Journal on Computing, Vol. 22, No. 1, pp. 2-17 (2010).(c) Informs

This paper addresses four problem classes, defined by two attributes: the number of entities being managed (single or many), and the complexity of the attributes of an entity (simple or complex). All the problems are stochastic, dynamic optimization problems. Single, simple-entity problems can be solved using classical methods from discrete state, discrete action dynamic programs. The AI community often works on problems with a single, complexity entity (e.g. a backgammon board). The OR community tends to work on problems with many simple entities. This paper briefly describes how advances in approximate dynamic programming performed within each of these communities can be brought together to solve problems with multiple, complex entities.

(click here to download paper)

Nascimento, J. and W. B. Powell, “An Optimal Approximate Dynamic Programming Algorithm for the Lagged Asset Acquisition Problem,” Mathematics of Operations Research, Vol. 34, No. 1, pp. 210-237 (2009).

I have worked for a number of years using piecewise linear function approximations for a broad range of complex resource allocation problems. A few years ago we proved convergence of this algorithmic strategy for two-stage problems (click here for a copy). In this latest paper, we have our first convergence proof for a multistage problem. This paper is more than a convergence proof for this particular problem class – it lays out a proof technique, which combines our work on concave approximations with theory laid out by Bertsekas and Tsitsiklis (in their Neuro-Dynamic Programming book).

(click here to download paper)

Ma, J. and W. B. Powell, “A convergent recursive least squares policy iteration algorithm for multi-dimensional Markov decision process with continuous state and action spaces,” IEEE Conference on Approximate Dynamic Programming and Reinforcement Learning (part of IEEE Symposium on Computational Intelligence), March, 2009.

This conference proceedings paper provides a sketch of a proof of convergence for an ADP algorithm designed for problems with continuous and vector-valued states and actions. In addition, it also assumes that the expected in Bellman’s equation cannot be computed. The proof assumes that the value function can be expressed as a finite combination of known basis functions. The proof is for a form of approximate policy iteration.

(click here to download paper)

George, A., W.B. Powell and S. Kulkarni, “Value Function Approximation Using Hierarchical Aggregation for Multiattribute Resource Management,” Journal of Machine Learning Research, Vol. 9, pp. 2079-2111 (2008).

There are a number of problems in approximate dynamic programming where we have to use coarse approximations in the early iterations, but we would like to transition to finer approximations as we collect more information. This paper studies the statistics of aggregation, and proposes a weighting scheme that weights approximations at different levels of aggregation based on the inverse of the variance of the estimate and an estimate of the bias. This weighting scheme is known to be optimal if we are weighting independent statistics, but this is not the case here. What is surprising is that the weighting scheme works so well. We demonstrate this, and provide some important theoretical evidence why it works.

(click here to download paper)

George, A. and W.B. Powell, “Adaptive Stepsizes for Recursive Estimation with Applications in Approximate Dynamic Programming,” Machine Learning, Vol. 65, No. 1, pp. 167-198, (2006). (c) Springer

One of the first challenges anyone will face when using approximate dynamic programming is the choice of stepsizes. Use the wrong stepsize formula, and a perfectly good algorithm will appear not to work. Deterministic stepsize formulas can be frustrating since they have parameters that have to be tuned (difficult if you are estimating thousands of values at the same time). This paper reviews a number of popular stepsize formulas, provides a classic result for optimal stepsizes with stationary data, and derives a new optimal stepsize formula for nonstationary data. This result assumes we know the noise and bias (knowing the bias is equivalent to knowing the answer). A formula is provided when these quantities are unknown. Our result is compared to other deterministic formulas as well as stochastic stepsize rules which are proven to be convergent. The numerical work suggests that the new optimal stepsize formula (OSA) is very robust. It often is the best, and never works poorly.

(Click here to download paper)

Powell, W.B., A. George, B. Bouzaiene-Ayari and H. Simao, “Approximate Dynamic Programming for High Dimensional Resource Allocation Problems,” Proceedings of the IJCNN, Montreal, August 2005.

This is an easy introduction to the use of approximate dynamic programming for resource allocation problems. It shows how math programming and machine learning can be combined to solve dynamic programs with many thousands of dimensions, using techniques that are easily implemented on a laptop.

(Click here to download paper)

Powell, W.B., “The Optimizing-Simulator: Merging Simulation and Optimization using Approximate Dynamic Programming,” Proceedings of the Winter Simulation Conference, December, 2005.

Approximate dynamic programming involves iteratively simulating a system. As a result, it often has the appearance of an “optimizing simulator.” This short article, presented at the Winter Simulation Conference, is an easy introduction to this simple idea.

(Click here to download paper)

Introductions to ADP

W. B. Powell, H. Simao, B. Bouzaiene-Ayari, “Approximate Dynamic Programming in Transportation and Logistics: A Unified Framework,” European J. on Transportation and Logistics, Vol. 1, No. 3, pp. 237-284 (2012). DOI 10.1007/s13676-012-0015-8.

Using the contextual domain of transportation and logistics, this paper describes the fundamentals of how to model sequential decision processes (dynamic programs), and outlines four classes of policies. A section describes the linkage between stochastic search and dynamic programming, and then provides a step by step linkage from classical statement of Bellman’s equation to stochastic programming. The remainder of the paper uses a variety of applications from transportation and logistics to illustrate the four classes of policies.

(Click here to download paper)

Powell, W. B., “Approximate Dynamic Programming II: Algorithms,” Encyclopedia of Operations Research and Management Science, John Wiley and Sons, (to appear).

(click here to download: ADP – I: Modeling)

(click here to download: ADP – II: Algorithms)

Powell, W. B. “What you should know about approximate dynamic programming,” Naval Research Logistics, Vol. 56, No. 3, pp. 239-249, 2009.

This article is a brief overview and introduction to approximate dynamic programming, with a bias toward operations research. It highlights the major dimensions of an ADP algorithm, some strategies for approximating value functions, and brief discussions of good (and bad) modeling and algorithmic strategies.

(click here to download paper)

Powell, W. B., “Approximate Dynamic Programming: Lessons from the field,” Invited tutorial, Proceedings of the 40th Conference on Winter Simulation, pp. 205-214, 2008.

This is the third in a series of tutorials given at the Winter Simulation Conference. This one has additional practical insights for people who need to implement ADP and get it working on practical applications.

(Click here to download paper)

Powell, W. B., “Approximate Dynamic Programming – A Melting Pot of Methods,” Informs Computing Society Newsletter, Fall, 2008 (Harvey Greenberg, ed.)

This article appeared in the Informs Computing Society Newsletter. It provides an easy, high-level overview of ADP, emphasizing the perspective that ADP is much more than an algorithm – it is really an umbrella for a wide range of solution procedures which retain, at their core, the need to approximate the value of being in a state.

(click here to download paper)

Approximate dynamic programming in transportation and logistics:

(Click here to download paper)

Simao, H. P., J. Day, A. George, T. Gifford, J. Nienow, W. B. Powell, “An Approximate Dynamic Programming Algorithm for Large-Scale Fleet Management: A Case Application,” Transportation Science, Vol. 43, No. 2, pp. 178-197 (2009). (c) Informs.

This is a major application paper, which summarizes several years of development to produce a model based on approximate dynamic programming which closely matches historical performance. The model represents drivers with 15 attributes, capturing domicile, equipment type, days from home, and all the rules (including the 70 hour in eight days rule) governing drivers. The model gets drivers home, on weekends, on a regular basis (again, closely matching historical performance). The value functions produced by the ADP algorithm are shown to accurately estimate the marginal value of drivers by domicile.

(click here to download paper) See also the companion paper below:

Simao, H. P. A. George, Warren B. Powell, T. Gifford, J. Nienow, J. Day, “Approximate Dynamic Programming Captures Fleet Operations for Schneider National,” Interfaces, Vol. 40, No. 5, pp. 342-352, 2010. (c) Informs

This paper is a lite version of the paper above, submitted for the Wagner competition. This paper does with pictures what the paper above does with equations.

(click here to download paper)

Powell, W. B., Belgacem Bouzaiene-Ayari, Jean Berger, Abdeslem Boukhtouta, Abraham P. George, “The Effect of Robust Decisions on the Cost of Uncertainty in Military Airlift Operations”, ACM Transactions on Automatic Control, Vol. 22, No. 1, pp. 39-57 (2011), DOI: 10.1145/2043635.2043636.

This paper shows that approximate dynamic programming can produce robust strategies in military airlift operations. Simulations are run using randomness in demands and aircraft availability. When demands are uncertain, we vary the degree to which the demands become known in advance. The results show that if we allocate aircraft using approximate dynamic programming, the effect of uncertainty is significantly reduced. These results call into question simulations that examine the effect of advance information which do not use robust decision-making, a property that we feel reflects natural human behavior.

(Click here to download paper)

Simao, H. P. and W. B. Powell, “Approximate Dynamic Programming for Management of High Value Spare Parts”, Journal of Manufacturing Technology Management Vol. 20, No. 9 (2009).

This is a short conference proceedings paper that briefly summarizes the use of approximate dynamic programming for a real application to the management of spare parts for a major aircraft manufacturer.

(click here to download paper)

Topaloglu, H. and W.B. Powell, “Dynamic Programming Approximations for Stochastic, Time-Staged Integer Multicommodity Flow Problems,” Informs Journal on Computing, Vol. 18, No. 1, pp. 31-42 (2006). (c) Informs.

This paper applies the technique of separable, piecewise linear approximations to multicommodity flow problems. This technique worked very well for single commodity problems, but it was not at all obvious that it would work well for multicommodity problems, since there are more substitution opportunities. The experimental comparisons against multistage nested Benders (which is very slow) and more classical rolling horizon procedures suggests that it works very well indeed. This paper also provides a more rigorous treatment of what is known as the “multiperiod travel time” problem, and provides a formal development of a procedure for accelerating convergence.

(Click here to download paper)

Powell, W.B. and T. Carvalho, “Dynamic Control of Logistics Queueing Networks for Large Scale Fleet Management,” Transportation Science, Vol. 32, No. 2, pp. 90-109, 1998. (c) Informs.

This paper introduces the use of linear approximations of value functions that are learned adaptively.

Powell, W.B., J. Shapiro and H. P. Simao, “An Adaptive, Dynamic Programming Algorithm for the Heterogeneous Resource Allocation Problem,” Transportation Science, Vol. 36, No. 2, pp. 231-249 (2002). (c) Informs.

This paper also used linear approximations, but in the context of the heterogeneous resource allocation problem. In this setting, we assume that the size of the attribute state space of a resource is too large to enumerate. As a result, estimating the value of resource with a particular set of attributes becomes computationally difficult. We resort to hierarchical aggregation schemes.

Godfrey, G. and W.B. Powell, “An Adaptive Dynamic Programming Algorithm for Dynamic Fleet Management, I: Single Period Travel Times,” Transportation Science, Vol. 36, No. 1, pp. 21-39 (2002). (c) Informs

Godfrey, G. and W.B. Powell, “An Adaptive Dynamic Programming Algorithm for Dynamic Fleet Management, II: Multiperiod Travel Times,” Transportation Science, Vol. 36, No. 1, pp. 40-54 (2002). (c) Informs

This paper adapts the CAVE algorithm to stochastic multistage problems. The paper demonstrates both rapid convergence of the algorithm as well as very high quality solutions. We found that the use of nonlinear approximations was complicated by the presence of multiperiod travel times (a problem that does not arise when we use linear approximations).

(Click here to download paper I)

(Click here to download paper II)

Powell, W.B., A. Ruszczynski and H. Topaloglu, “Learning Algorithms for Separable Approximations of Stochastic Optimization Problems,” Mathematics of Operations Research, Vol 29, No. 4, pp. 814-836 (2004). (c) Informs.

We have been doing a lot of work on the adaptive estimation of concave functions. This paper represents a major plateau. It describes a new algorithm dubbed the Separable Projective Approximation Routine (SPAR) and includes 1) a proof that the algorithm converges when we sample all intervals infinitely often, 2) a proof that the algorithm produces an optimal solution when we only sample the optimal solution of our approximation at each iteration, when applied to separable problems, 3) a bound when the algorithm is applied to nonseparable problems such as two-stage stochastic programs with network resource, and 4) computational comparisons against deterministic approximations and variations of Benders decomposition (which is provably optimal). The experiments show that the SPAR algorithm, even when applied to nonseparable approximations, converges much more quickly than Benders decomposition.

(Click here to download paper)

Approximate dynamic programming in discrete routing and scheduling:

Spivey, M. and W.B. Powell, “The Dynamic Assignment Problem,” Transportation Science, Vol. 38, No. 4, pp. 399-419 (2004). (c) Informs.

This paper proposes a general model for the dynamic assignment problem, which involves the assignment of resources to tasks over time, in the presence of potentially several streams of information processes. It proposes an adaptive learning model that produces non-myopic behavior, and suggests a way of using hierarchical aggregation to reduce statistical errors in the adaptive estimation of the value of resources in the future. Finally, it reports on a study on the value of advance information. Past studies of this topic have used myopic models where advance information provides a major benefit over no information at all. Our model uses adaptive learning to bring forecast information into decisions made now, providing a more realistic estimate of the value of future information.

(Click here to download paper)

Approximate dynamic programming for batch service problems

Papadaki, K. and W.B. Powell, “An Adaptive Dynamic Programming Algorithm for a Stochastic Multiproduct Batch Dispatch Problem,” Naval Research Logistics, Vol. 50, No. 7, pp. 742-769, 2003.

One of the oldest problems in dynamic programming arises in the context of planning inventories. You can use textbook backward dynamic programming if there is only one product type, but real problems have multiple products. In this paper, we consider a multiproduct problem in the context of a batch service problem where different types of customers wait to be served. Arrivals are stochastic and nonstationary. We show that an approximate dynamic programming strategy using linear value functions works quite well and is computationally no harder than a simple myopic heuristics (once the iterative learning is completed).

(click here to download paper)

Papadaki, K. and W.B. Powell, “Exploiting structure in adaptive dynamic programming algorithms for a stochastic batch service problem,” European Journal of Operational Research, Vol. 142, No. 1, pp. 108-127 (2002). (c) Elsevier.

This paper compares an optimal policy for dispatching a truck over a single link (with one product type) against an approximate policy that uses approximations of the future. Why would we approximate a problem that is easy to solve to optimality? Because the optimal policy only works on single link problems with one type of product, while the other is scalable to much harder problems.

(click here to download paper)