The computational and applied research in CASTLE Labs is supported by a strong foundation of theoretical research. Most of these are contributions to stochastic optimization, but recently we have been making contributions to machine learning and a field of research we are referring to as optimal learning. This work has led to practical breakthroughs such as a fleet simulator for a major transportation company which produced an infinite dimensional stochastic dynamic program, and a stochastic, multiple scale model for energy resource planning which spanned hundreds of thousands of time periods.

*This paper derives the first finite-time bounds for a knowledge gradient policy, using the principle of the ***prior view*** of the objective function (in contrast with the more classical posterior view). The prior view provides a cleaner relationship between the performance of the policy and the sample taken, making it possible to relate the value of information to the submodularity of the sample. The paper then introduces a Modular Optimal Learning Testing Environment (MOLTE) which provides a highly flexible environment for testing a range of learning policies on a library of test problems. Click here to access MOLTE.*

**Yan Li, Han Liu, W.B. Powell, “The Knowledge Gradient Policy using a Sparse Additive Belief Model,” **

*This paper derives a knowledge gradient policy using a sparse additive belief model, combining a frequentist recursive Lasso algorithm with a Bayesian belief model. The paper provides finite time bounds on the error in the estimates of the sparse additive model.*

**Daniel Jiang, W. B. Powell, “An Approximate Dynamic Programming Algorithm for Monotone Value Functions,” **(updated May 25, 2015)

*This paper proves convergence of an ADP algorithm using a lookup table representation of the value function which maintains monotonicity in each of the dimensions. The paper includes both an asymptotic convergence proof as well as rate of convergence results. The paper demonstrates that the algorithm outperforms popular competing algorithms using three benchmark problems for which optimal policies can be found.*

**Boris Defourny, Ilya O. Ryzhov, W. B. Powell, “Optimal Information Blending with Measurements in the L2 Sphere,” Mathematics of Operations Research, Vol. 40, No. 4, Nov 2015, pp. 1060-1088. http://dx.doi.org/10.1287/moor.2015.0712. **

*We consider a sequential information collection problem where a risk-averse decision-maker updates a Bayesian belief about the unknown objective function of a linear program. The information is collected in the form of a linear combination of the objective coefficients, subject to random noise. We have the ability to choose the weights in the linear combination, creating a new, nonconvex continuous optimization problem, which we refer to as information blending. We develop two optimal blending strategies: an active learning method that maximizes uncertainty reduction, and an economic approach that maximizes an expected improvement criterion. Semidefinite programming relaxations are used to create efficient convex approximations to the nonconvex blending problem.*

*This paper studies stepsizes in the context of bootstrapping-based algorithms (value iteration, Q-learning), which introduces the problem of designing a stepsize that balances the need to add costs/rewards over time (which requires larger stepsizes), while smoothing noise (which requires smaller stepsizes). We provide tight upper and lower bounds that proves that 1/n converges to the optimal solution, but converges so slowly it should never be used. We then use a single state, single-action problem to derive an optimal stepsize for both steady state and finite horizon problems.*

*This paper proves convergences of an optimal learning algorithm (using the knowledge gradient) while using a local parametric belief model. *

*Stochastic search involves finding a set of controllable parameters that minimizes an unknown objective function using a set of noisy observations. We consider the case when the unknown function is convex and a metamodel is used as a surrogate objective function. Often the data are non-i.i.d. and include a observable state variable, such as applicant information in a loan rate decision problem. State information is difficult to incorporate into convex models. We propose a new semi-convex regression method that is used to produce a convex metamodel in the presence of a state variable. We show consistency for this method. We demonstrate its effectiveness for metamodeling on a set of synthetic inventory management problems and a large, real-life auto loan dataset.*

**R. Collado, W. B. Powell, “Threshold Risk Measures for Stochastic Optimization,”** (under review)

*In this paper we introduce the threshold risk measures, a class of risk measures incompatible with the coherent risk measures. In particular, the threshold risk measures consider the risk involved in applications where being above a threshold (for minimization problems) is considered too risky and should be avoided as much as possible. In this paper we develop the threshold risk measures together with their dynamic counterparts and apply these to the nite horizon dynamic programming with risk measures model. We develop several Bellman-type recursive algorithms to solve the nite horizon problem dynamic problem.*

**J. Nascimento, W. B. Powell, “An Optimal Approximate Dynamic Programming Algorithm for Concave, Scalar Storage Problems with Vector-Valued Controls,” IEEE Transactions on Automatic Control, Vol. 58, No. 12, pp. 2995-3010. http://dx.doi.org/10.1109/TAC.2013.2272973 (2013).**

*This paper proves convergence for an ADP algorithm using approximate value iteration (TD(0))*, *for problems that feature vector-valued decisions (e.g. allocating energy over a grid), linked by a scalar storage system, such as a water reservoir. The problem arises in settings where resources are distributed from a central storage facility. The algorithm is well suited to continuous problems which requires that the function that captures the value of future inventory be finely discretized, since the algorithm adaptively generates break points for a piecewise linear approximation. The strategy does not require exploration, which is common in reinforcement learning.*

*The knowledge-gradient policy was originally derived for off-line learning problems such as ranking and selection. Considerable attention has been given to the on-line version of this problem, known popularly as the multiarmed bandit problem, for which Gittins indices are known to be optimal for discounted, infinite-horizon versions of the problem. In this paper, we derive a knowledge gradient policy for on-line problems, and show that it very closely matches the performance of Gittins indices for discounted infinite horizon problems. It actually slightly outperforms the best available approximation of Gittins indices (by Gans and Chick) on problems for which Gittins indices should be optimal. The KG policy is also effective on finite horizon problems. The only policy which is competitive with KG seems to be interval estimation, but this requires careful tuning of a parameter. The KG policy also works on problems where the beliefs about different alternatives are correlated. The KG policy with independent beliefs is extremely easy to compute (we provide closed-form expressions for the case with normal rewards), and requires a simple numerical algorithm for the case with correlated beliefs.*

**Ryzhov, I., W. B. Powell, “Information Collection for Linear Programs with Uncertain Objective Coefficients,” SIAM J. Optimization, Vol. 22(4), pp. 1344–1368 http://epubs.siam.org/doi/abs/10.1137/12086279X. (2012).**

Finding the optimal solution of a linear program assumes that you have accurate information on costs (among other things). In some settings, these costs may be approximate, and getting more information can be expensive. For example, a problem in logistics might require including costs that reflect the quality of service provided by a vendor, but it may be necessary to use the vendor for a period of time, or collect historical information from other manufacturers, to refine these costs.

This paper develops the knowledge gradient for maximizing the expected value of information when solving linear programs. The paper establishes asymptotic optimality for off-line versions of the problem and proposes a computationally tractable algorithm.

*We consider Bayesian information collection, in which a measurement policy collects information to support a future decision. This framework includes ranking and selection, continuous global optimization, and many other problems in sequential experimental design. We give a sufficient condition under which measurement policies sample each measurement type infinitely often, ensuring consistency, i.e., that a globally optimal future decision is found in the limit. This condition is useful for verifying consistency of adaptive sequential sampling policies that do not do forced random exploration, making consistency difficult to verify by other means. We demonstrate the use of this sufficient condition by showing consistency of two previously proposed ranking and selection policies: OCBA for linear loss, and the knowledge-gradient policy with independent normal priors. Consistency of the knowledge-gradient policy was shown previously, while the consistency result for OCBA is new.*

*We study the joint problem of sequential change detection and multiple hypothesis testing. Suppose that the common distribution of a sequence of i.i.d. random variables changes suddenly at some unobservable time to one of nitely many distinct alternatives, and one needs to both detect and identify the change at the earliest possible time. We propose computationally efficient sequential decision rules that are asymptotically either Bayes-optimal or optimal in a Bayesian fixed-error formulation, as the unit detection delay cost or the misdiagnosis and false alarm probabilities go to zero, respectively. Numerical examples are provided to verify the asymptotic optimality and the speed of convergence.*

*This paper addresses the problem of optimizing a stochastic function that is heavy-tailed (infinite variance). We first present an algorithm for calculating a quantile of the function (e.g. the median) without having to store a history of observations of the function. We then present an adaptation of the original Robbins-Monro algorithm which finds the parameter that optimizes a quantile of the random function using asymmetric signum functions, which avoids calculations that involve potentially very large gradients. We prove convergence of the algorithm, and then demonstrate it on a newsvendor problem (which is not heavy-tailed), where we show that optimizing higher quantiles of the newsvendor can reduce or elimiinate the risk of losing money. The algorithm is then demonstrated for a problem to optimize a policy for energy storage in the presence of electricity spot prices, which are heavy tailed.*

**L. Hannah, W. B. Powell, D. Blei, “Nonparametric Density Estimation for Stochastic Optimization with an Observable State Variable,” Neuro-Information Processing Society, December, 2010. Supplement**

*Consider the newsvendor problem, where we have to choose a quantity and then observe a demand. Now assume that we are first given information about the state of the world – perhaps it is rainy or sunny. This observable state changes our belief about the demand, which changes our optimal solution. We have to design an online, stochastic optimization algorithm that solves the problem without actually knowning the distribution of demand, even when we know the state. If there were only two states of the world, the problem would be easy. Now assume the state of the world is a multidimensional vector. We use Dirichlet process mixtures to prove convergence to optimality of an online learning algorithm which can be used for complex state-of-the-world variables. We demonstrate the method in the context of making commitments for wind energy.*

Using kernel smoothing techniques, we propose three dierent online, on-policy approximate policy iteration algorithms which can be applied to innite horizon problems with continuous and vector-valued states and actions. Using Monte Carlo sampling to estimate the value function around the post-decision state, we reduce the problem to a sequence of deterministic, nonlinear programming problems that allow us to handle continuous, vector-valued states and actions. We provide a formal convergence analysis of the algorithms under a variety of

technical assumptions.

*We have previously developed the knowledge gradient with correlated beliefs for discrete alternatives. This paper extends this idea to problems with continuous alternatives. We do this by developing a continuous approximate of the knowledge gradient. This produces a nonconcave surface that we have to maximize. Local minima are located close to points that have been previously measured, so we use these points to guess at the locations of local maxima and then use a simple gradient search algorithm starting from each of these points. A proof of convergence is provided. We use the distances between local minima to perform scaling of the steepest descent algorithm. We compare the method against Huang’s adaptation of sequential kriging to problems with noisy measurements.*

We propose Dirichlet Process-Generalized Linear Models (DP-GLM), a new method of nonparametric regression that accommodates continuous and categorical inputs, and any response that can be modeled by a generalized linear model. We prove conditions for the asymptotic unbiasedness of the DP-GLM regression mean function estimate and give a practical example for when those conditions hold. Additionally, we provide Bayesian bounds on the distance of the estimate from the true mean function based on the number of observations and posterior samples. We evaluate DP-GLM on several data sets, comparing it to modern methods of nonparametric regression like CART and Gaussian processes. We show that the DP-GLM is competitive with the other methods, while accommodating various inputs and outputs and being robust when confronted with heteroscedasticity.

We propose an online, on-policy least-squares policy iteration (LSPI) algorithm which can be applied to innite horizon problems with where states and controls are vector-valued and continuous. We do not use special structure such as linear, additive noise, and we assume that the expectation cannot be computed exactly. We use the concept of the post-decision state variable to eliminate the expectation inside the optimization problem. We provide a formal convergence analysis of the algorithm under the assumption that value functions are spanned by nitely many known basis functions. Furthermore, the convergence result extends to the more general case of unknown value function form using orthogonal polynomial approximation.

*I have worked for a number of years using piecewise linear function approximations for a broad range of complex resource allocation problems. A few years ago we proved convergence of this algorithmic strategy for two-stage problems (click here for a copy). In this latest paper, we have our first convergence proof for a multistage problem. This paper is more than a convergence proof for this particular problem class – it lays out a proof technique, which combines our work on concave approximations with theory laid out by Bertsekas and Tsitsiklis (in their Neuro-Dynamic Programming book).*

*One of the most famous problems in information collection is the multiarmed bandit problem, where make a choice (out of a discrete set of choices), observe a reward, and use this observation to update estimates of the future value of rewards. This problem can be solved by choosing the option with the highest index (known as the Gittins index). In this paper, we present an index problem for the case where not all the choices are available each time. As a result, it is sometimes important to make an observation just because the observation is available to be made.*

*The knowledge gradient policy is introduced here as a method for solving the ranking and selection problem, which is an off-line version of the multiarmed bandit problem. Imagine that you have M choices (M is not too large) where you have a normally distributed belief about the value of each choice. You have a budget of N measurements to evaluate each choice to refine your distribution of belief. After your N measurements, you have to choose what appears to be the best based on your current belief. The knowledge gradient policy guides this search by always choosing to measure the choice which would produce the highest value if you only have one more measurement (the knowledge gradient can be viewed as a method of steepest ascent). The paper shows that this policy is myopically optimal (by construction), but is also asymptotically optimal, making it the only stationary policy that is both myopically and asymptotically optimal. The paper provides bounds for finite measurement budgets, and provides experimental work that shows that it works as well as, and often better, than other standard learning policies.*

Structural properties of stochastic dynamic programs are essential to understanding the nature of the solutions and in deriving appropriate approximation techniques. We concentrate on a class of multidimensional Markov decision processes and derive sufficient conditions for the monotonicity of the value functions. We illustrate our result in the case of the multi-product batch dispatch (MBD) problem.

*One of the first challenges anyone will face when using approximate dynamic programming is the choice of stepsizes. Use the wrong stepsize formula, and a perfectly good algorithm will appear not to work. Deterministic stepsize formulas can be frustrating since they have parameters that have to be tuned (difficult if you are estimating thousands of values at the same time). This paper reviews a number of popular stepsize formulas, provides a classic result for optimal stepsizes with stationary data, and derives a new optimal stepsize formula for nonstationary data. This result assumes we know the noise and bias (knowing the bias is equivalent to knowing the answer). A formula is provided when these quantities are unknown. Our result is compared to other deterministic formulas as well as stochastic stepsize rules which are proven to be convergent. The numerical work suggests that the new optimal stepsize formula (OSA) is very robust. It often is the best, and never works poorly.*

*We have been doing a lot of work on the adaptive estimation of concave functions. This paper represents a major plateau. It describes a new algorithm dubbed the *Separable Projective Approximation Routine *(SPAR) and includes 1) a proof that the algorithm converges when we sample all intervals infinitely often, 2) a proof that the algorithm produces an optimal solution when we only sample the optimal solution of our approximation at each iteration, when applied to separable problems, 3) a bound when the algorithm is applied to nonseparable problems such as two-stage stochastic programs with network resource, and 4) computational comparisons against deterministic approximations and variations of Benders decomposition (which is provably optimal). The experiments show that the SPAR algorithm, even when applied to nonseparable approximations, converges much more quickly than Benders decomposition.*

*This paper offers a provably convergent algorithm for two-stage stochastic programs. The method uses a differentiable (strictly convex) auxiliary function, and then “tilts” it using stochastic gradient information. The algorithm is very easy to implement.*

*This paper provides a proof of convergence for a class of nested Bender’s cutting-plane algorithm. The basic model allows only right-hand side randomness, and requires independence between stages. The algorithm is computationally tractable for problems that are much larger than any prior algorithm for which a proof of convergence has been produced. It requires a finite sample space, but it may have thousands of realizations per stage, and the technique can handle dozens (even hundreds) of stages. Numerical performance will depend on the specific problem class.*