Warren B. Powell
Fundamental to sequential decision problems is the concept of a state variable. Astonishingly, there are entire communities that deal with sequential decision problems that use the term “state variable” (or state space) without offering a proper definition (although some think they do).
Below is a summary of some competing perspectives of state variables. I end with “My definition” which I think is precise and clear enough to teach to any audience, and resolves all of my issues with state variables. An extended description can be found in section 9.4 of RLSO.
Definitions from the MDP community:
Some “definitions” of state variables:
- Bellman’s seminal text [Bellman (1957), p. 81] says “… we have a physical system characterized at any stage by a small set of parameters, the state variables.” (Italics are from the original text)
- Puterman first introduces a state variable by saying [Puterman (2005), p. 18] “At each decision epoch, the system occupies a state.” (Italics are from the original text)
- From Wikipedia: “A state variable is one of the set of variables that are used to describe the mathematical ‘state’ of a dynamical system.” (The next sentence says: “Intuitively, the state of a system describes enough about the system to determine its future behaviour in the absence of any external forces affecting the system.” But, we can still define state variables in the presence of exogenous information flows, so this statement is not accurate either.)
Let me first start by asking: Didn’t we all learn in grade school that we do not use the word we are defining in its definition??!!
A definition from the RL community:
The reinforcement literature inherited the style of not defining state variables from the literature on Markov decision processes, but a notable exception is the second edition of Sutton and Barto’s Reinforcement Learning: An introduction. While they never explicitly define a state variable, they offer descriptions:
- On p. 7, under section 1.4 Limitations and Scope, the authors note: “.. we encourage the reader to follow the informal meaning and think of the state as whatever information is available about its environment.”
- In chapter 3 they then say [p. 49] “The state must include information about all aspects of the past agent-environment interaction that make a difference for the future.”
The first bullet seems to suggest that all available information (about the environment) is in the state variable, but does not define “environment.” The second bullet includes the condition “that make(s) a difference for the future.” Keep reading.
From some theoreticians:
I have spoken to numerous mathematicians (in stochastic control/optimization) who will insist “but I know what a state variable is.” Consider the following anecdotes of statements made by some of the best known names in the field:
From Probability and Stochastics by Erhan Cinlar (2011) – a former colleague at Princeton and one of the best known probabilists in the field: “The definitions of ‘time’ and ‘state’ depend on the application at hand and on the demands of mathematical tractability. Otherwise, if such practical considerations are ignored, every stochastic process can be made Markovian by enhancing its state space sufficiently.”
- From Bertsekas’ Dynamic Programming and Optimal Control: Approximate Dynamic Programming (4th edition, 2012): “… we assume that at each time k, the control is applied with knowledge of the current state x_k. Such policies are called Markov because they do not involve dependence on states beyond the current. However, what if the control were allowed to depend on the entire past history:
which ordinarily would be available at time k. Is it possible that better performance can be achieved in this way?” (WBP: If this were the case, then there is information from “history” that is needed to make decisions, so why isn’t this included in the state variable?)
- In Puterman’s wonderful book Markov Decision Processes, on p. 97 he presents a graph problem that involves finding the path through a network that minimizes the second highest cost on the path (rather than the sum of the costs). He then goes on to argue that Bellman’s optimality equation no longer works! This is because he changes how costs are calculated, but still assumes the state of the system is the node where a traveler is located. The problem is that with the revised cost metric, you also have to keep track of the two highest costs on the path the traveler has traversed, because this is what is needed to determine whether a cost on the next arc is one of the top two.
If we agree that a state is all the information you need to model the system from time t onward, then the system is, by definition (and by construction) Markovian. Further, you would never need information from history since again, by definition (and by construction), the state variable already has any information that may have arrived before time t (or “time” k). So, there is no need to “expand the state space sufficiently,” nor any need to depend on history.
[Side note: a talented post-doc in my lab posed the question: What if we simply do not know all the information we need? This raises subtle issues that are more than I can cover on a webpage. See note (vii) on page 483 of RLSO (following the definition of states) and section 20.2 in RLSO which uses a two-agent model of flu mitigation to illustrate the setting of when a controlling agent does not know the environment.]
Definitions from optimal control
Now look at some definitions in books on optimal control:
- From Kirk (2004): A state variable is a set of quantities x_1(t),x_2(t),\ldots, which if known at time t, are determined for t \geq t_0 by specifying the inputs for the system for t \geq t_0.
- From Cassandras and Lafortune (2008): The state of a system at time t_0 is the information required at time t_0 such that the output [cost] y(t) for all t \geq t_0 is uniquely determined from this information and from u(t).
These are both stated as formal definitions, and both can be restated simply as:
- The state is all the information you need at time t to model the system from time t onward.
Both of the definitions above understand that to model the system moving forward, you need the controls u(t) (presumably determined by a “control law” or “policy”) as well as any exogenous (random) information. These definitions appear to be standard in optimal control.
I like the characterization, widely used in books on optimal control, that the state variable is all the information you need to model the system from time t onward, regardless of when the information arrived! My only complaint is that it needs to be more explicit.
In my new book (RLSO)[section 9.4], I offer two definitions of state variables depending on whether a policy has been specified or not.
- Policy-dependent version – A function of history that, combined with the exogenous information (and a policy), is necessary and sufficient to compute the cost/contribution function, the decision function (the policy), and any information required by the transition function to model the information needed for the cost/contribution and decision functions.
- Optimization version – A function of history that is necessary and sufficient to compute the cost/contribution function, the constraints, and any information required by the transition function to model the information needed for the cost/contribution function and the constraints.
Both definitions are completely consistent with the “all the information you need …” definitions from optimal control. It is just that I have identified the specific places where we need to provide information: the cost/contribution function, the policy (or constraints), and then the equations used to model how this information evolves over time (this is inside the transition function).
I find it useful to note that a state variable is information which may come in three flavors:
- Physical state variables R_t – This might be inventory, the location of a vehicle on a graph, the attributes of a person, machine or patient.
- Informational variables I_t – This is any information about quantities or parameters that are not included in R_t. Examples could be market prices, weather, or traffic conditions.
- Belief variables B_t – These are statistics (frequentist) or parameters of probability distributions (Bayesian) describing any quantities or parameters that we do not know perfectly. This could be used to describe how a market responds to price, the time that a shipment might arrive, the state of a patient or complex machine.
While the optimal control literature is the best I have seen in terms of defining state variables, I have yet to see a control paper that recognizes that a belief can be part of a state variable.
The POMDP (partially observable Markov decision process) literature creates a special dynamic program where the belief about a quantity can be a state, but this literature does not seem to recognize that you can have physical state variables and belief state variables that combine to form the state variable. A good example arises in clinical trials where you have a physical state (how many patients are remaining in the pool, how much money you have remaining) and the belief about the efficacy of the drug.