From the jungle of stochastic optimization to… Sequential Decision Analytics

I send out frequent updates regarding my work on LinkedIn – if you are not a follower, please reach out and ask to join.

Some highlights:

Sequential decision problems are problems that consist of decision, information, decision, information, …. while incurring costs or receiving rewards.  Sequential decision problems cover an incredibly broad problem class, spanning engineering, the sciences, business, economics, finance, health, transportation, energy and e-commerce.   The problems may be discrete dynamic programs, continuous control problems, graph problems, stochastic search, active learning, and multiagent games and applications. 

We have a very good handle on modeling deterministic optimization problems, but the universe of problems that involve sequential decisions and information have resisted being expressed in the kind of common canonical framework that has become universal in deterministic optimization.

Sequential decision problems are so diverse that the field has become fragmented into a Balkanized set of communities with competing algorithmic strategies and modeling styles. It is easy to spend a significant part of a career mastering the subtleties of each perspective on stochastic optimization, without recognizing the common themes.

We have developed a unified framework that covers all of these communities. We break all sequential decision problems into five components: state variables, decision variables, exogenous information variables, the transition function, and the objective function. We search over policies, which are functions (or methods) for making decisions. We have identified two core strategies for designing policies (policy search, and policies based on lookahead approximations), each of which further divides into two subclasses, creating four classes that cover all of the solution approaches in the books illustrated above, as well as anything being used in practice.

The modeling and algorithmic challenges involve three core elements:

  • Modeling the problem (state variables, decision variables, exogenous information processes, transition function, objective function).
  • Modeling the uncertainty (uncertainty quantification).
  • Designing and evaluating policies.

There is a growing community, which started in computer science, which uses the term “reinforcement learning” to refer to sequential decision problems.  My explanation of “artificial intelligence,” “machine learning,” and “reinforcement learning.”

Our approach does not replace any prior work: rather, it brings all of this work together, and helps to identify opportunities for cross-fertilization.  A good way to think of the unified framework is “standing on the shoulders of giants.”

Educational materials:

All of the material below, including the material for the graduate course, is written for students with an undergraduate course in probability and statistics.  From time to time a second course in probability may be useful, and I occasionally present problems that use linear programming, but in a way that can be read without a full course in math programming (there may be occasional exceptions).  

Talks and videos:

Some key points

  • State variables – We claim that all properly modeled dynamic systems are Markovian, and provide a teachable, implementable definition that students can use to guide their efforts to model these systems (see chapter 9 in Reinforcement Learning and Stochastic Optimization available at link above). (If you have a counterexample, email it to me). But examine this for a fresh perspective on state variables.
  • A dynamic program is a sequential decision problem (it is not a method). Bellman’s equations are a) conditions for an optimal policy and b) a path to designing good policies (but just one of four paths).  
  • We introduce the distinction between the base model, which is our model of the problem we are trying to solve, and a lookahead model, widely used as one class of policy (especially in stochastic programming). The base model is often referred to as a “simulator” without recognizing that this is the problem we are trying to solve. The base model is comparable to the objective function in a deterministic optimization model.
  • When solving a deterministic optimization problem, we want to find the best decision (often a real-valued vector). When solving a stochastic optimization problem, we typically want to find the best function (known as a policy). 
  • A multistage stochastic program (using scenario trees) is a lookahead model for designing a policy (called a lookahead policy) for solving dynamic programs. An optimal solution of a lookahead model is generally not an optimal policy (even if it is hard to solve).
  • “Robust optimization” (for sequential problems) is a lookahead policy (using a min-max objective) to solve a dynamic program where the objective may be formulated using an expectation (which is what is implied if you simulate the policy and average the results), or the worst case (which is consistent with the robust concept) or any other risk measure.

Our canonical model

The slides below provide a thumbnail sketch of the central idea behind our modeling and algorithmic strategy.

The first slide below contrasts the standard canonical form for a multiperiod, deterministic linear program (on the left), and the approach we are advocating to be the canonical form for a sequential, stochastic optimization problem:

DeterministicStochastic

The key idea is that when working on sequential stochastic problems, you are searching for the best function (known as a policy), rather than the more familiar problem of finding the best real-valued scalar or vector.

This raises the obvious question – how do you search over a space of functions???

We have taken a practical path for solving this question. We have identified four fundamental (meta) classes of policies, called PFAs, CFAs, VFAs and DLAs (direct lookaheads). These are summarized on the following slide:

We let \pi be the variable that captures both the type of function, and any parameters that determine the behavior of the function. A tunable parameter could be a regression parameter, or variables that determine the planning horizon, number of stages and number of scenarios.

In the equations above, we use tildes on variables that describe the lookahead model to avoid confusing them with the base model. This is only an issue when we are using lookahead policies.

Note that these four classes of policies span all the standard modeling and algorithmic paradigms, including dynamic programming (including approximate/adaptive dynamic programming and reinforcement learning), stochastic programming, and optimal control (including model predictive control).

storage problem

It is important to recognize that even for the same problem, each of the four classes of policies may work best, depending on the data. We demonstrate this in the context of a simple energy storage problem. We created five problems, which we solved using each of the four classes of policies, plus a hybrid. In the graph below, we report the performance of each policy as a fraction of the posterior optimal (the best you could do with perfect information). Each policy works best on one of the five problems.

Four classes of policies

W.B. Powell, S. Meisel, Tutorial on Stochastic Optimization in Energy II: An energy storage illustration, IEEE Trans. on Power Systems, Vol. 31, No. 2, pp. 1468-1475, 2016

We prefer the approach universally used in deterministic optimization where we formulate the problem first, and then we design methods to find a solution (in the form of a policy). But this requires accepting that in sequential problems, we are not looking for decisions (as we do in deterministic models), but rather functions (or policies). Classical strategies in stochastic optimization (which are described using familiar labels such as dynamic programming, stochastic programming, robust optimization and optimal control) actually represent particular classes of policies. So, to solve a problem using one of these approaches means you are actually choosing the class of policy before you have even started modeling the problem.

Warren Powell
wbpowell328@gmail.com