Reinforcement learning methods are an class of algorithms used to learn from an agent's experience in an episodic setting. Much like how we learn from experience, an RL algorithm learns by sequentially executing actions, observing the outcome of those actions, and updating its behavior so as to achieve better outcomes in the future. This approach is distinct from traditional supervised learning, in which we attempt to model some unobserved probability distribution from a finite set of observations. In a certain sense, RL algorithms generate their own labels through experimentation. As labels are generated only through simulation, though, two characteristics of RL methods become evident: (i) sample inefficiency--a single observation may require a very large number of actions, and (ii) reward-allocation enefficiency--for instance, although a sequence of chess moves may have produced a win, it says nothing about the relative quality of each individual move; thus, we cannot avoid reinforcing bad actions and penalizing good ones. The former can be viewed as a problem of sparse rewards while the second one introduced noise into the signal that we use in optimizing our learning model. Despite these drawbacks, RL methods are unique in that they enable us to optimize an agent to perform complex tasks of our choosing, such as playing chess, walking, or playing video games, and not just map inputs to outputs in a dataset. Within RL there are a few different entities that we can try to learn. Many of the modern RL methods, such as the actor-critic approach, combine them. Here we'll discuss some RL basics that I've used in developing a super-human checkers bot.

Episodes as Markov Chains

In reinforcement learning we are interested in maximizing the expected reward of an agent acting in some evironment. We represent the set of actions in an episode with a tuple of state, action, reward, next-state pairs, otherwise known as a trajectory \( \tau \sim (s_{0}, a_{0}, r_{0}, s_{1}, ..., s_{T}) \). The function, or set of rules, defining the state-next state transition behavior is referred to as the agent's \( \textit{policy} \), which is formally defined as a probability distribution over actions given a state, \( \pi(a \vert s) \). Think of an agent's policy as a representation of its decision making strategy, expressed as a conditional probability distribution. In using this definition, we implicitely restrict ourselves to processes that obey the Markov property, which describes any system for which the distribution over actions is dependent only on the current state, not prior ones. To those unfamilar this might seem restrictive, but suprisingly, many problems ranging from board games to robotics can be cast as a MDP or POMDP (partially observable MDP). This property allows us to use the chain rule to factor the joint probability distribution over the trajectory, \( p(s_{0}, a_{0}, r_{0}, s_{1}, ..., s_{T}) \), into a product of conditionals:
\[
p(s_{0}, a_{0}, r_{0}, s_{1}, ..., s_{T}) = p(s_{0})\prod_{t=0}^{T-1} \pi(a_{t} \vert s_{t}) p(r_{t}, s_{t+1} \vert a_{t}, s_{t}).
\tag{Eq. 1}\label{Eq. 1}
\]
These probabalistic factors underpin nearly all of reinforcement learning.
Reward assignment

The objective function to optimize is the cummulative, discounted, future reward. The recursive relationship between rewards from state to state lends itself to a recursive definition of the reward at time-step \( t \):
\[
\begin{split}
R_{t} &= r_{t+1} + \gamma r_{t+2} + \gamma^{2}r_{t+3} + ... \\
&= r_{t+1} + \gamma \big( r_{t+2} + \gamma r_{t+3} + ... \big) \\
&= r_{t+1} + \gamma R_{t+1} \\
&= \sum_{k=0}^{\infty}\gamma r_{t+k+1}
\end{split} \tag{Eq. 2}\label{Eq. 2}
\]
This definition is extensible to both finite and infinite length episodes. For example, if we wish to learn an optimal control policy of an HVAC system in a large building to minimize energy consumtion, we would enforce bounds of /( 0 < \gamma < 1 \). In a board game with a finite number of steps, we can relax that bound slightly, \( 0 < \gamma \leq 1 \), and then simply assign \( R_{t>T} := 0 \), where \( T \) is the terminal step, without loss of generality.

State value function

One example of where the factorization in (1) becomes useful is the state value function, \( v(s) \), which is defined as the expected reward when following policy \( \pi(a \vert s) \) from state \( \textit{s} \). Using the factorization in (1) along with the total discounted rewards emanating from timestep \( \textit{t} \) in expression (2), we can write the value function as:
\[
\begin{split}
v_{t}(s) &= E_{\tau\sim\pi} \big[R_{t} \vert s\big] \\
&= E_{\tau\sim\pi} \Big[\sum_{k=0}^{T} \gamma^{k}R_{t+k+1} | s_{t} \Big] \\
&= E_{\tau\sim\pi} \Big[r_{t} + \gamma \sum_{k=1}^{T} \gamma^{k}R_{t+k+1} | s_{t+1} \Big] \\
&= \sum_{a_{t}} \pi(a_{t}|s_{t}) \sum_{s_{t+1}} \sum_{r_{t}}p(r_{t},s_{t+1}|a_{t},s_{t}) \Big[r_{t} + \gamma v(s_{t+1}) \Big]
\end{split} \tag{Eq. 3}\label{Eq. 3}
\]

State value function

We similarly use this factorization in Q-learning, where we define the state-action value function, \( q(s,a) \), as the expected reward when starting in state, \( \textit{s} \), and greedily selecting action, \( \textit{a} \), w.r.t. some policy. Factorization yields:
\[
\begin{split}
q_{t}(s,a) &= E_{\tau\sim\pi} \big[R_{t} \vert a,s\big] \\
&= \sum_{s_{t+1}} \sum_{r_{t}}p(r_{t},s_{t+1}|a_{t},s_{t}) \Big[r_{t} + \gamma \underset{a_{t+1}}{ \ max \ } q(s_{t+1},a_{t+1}) \Big].
\end{split} \tag{Eq. 4}\label{Eq. 4}
\]
It is worth noting that the state value and Q functions can be learned using a dynamic program if the space of states, actions, and rewards can be enumerated. This amounts to tabulating rewards. For most interesting problems, though, the state space is prohibitivly large and we must parameterize our value function or policy and optimize some approximating function, such as a neural network. There are a host of algorithms that have been developed to learn these functions [cf. Sutton and Barto, 1998; Mnih et al., 2013].

Policy gradient methods

Policy gradient methods are concerned with directly learning the agent's policy, in which we optimize over the policy rather than states or state-action pairs. We start with the following optimization problem:
\[
\underset{\theta}{max} \\ E_{\tau\sim\pi}\big[R_{\tau} \vert \pi_{\theta} \big] \tag{Eq. 5}\label{Eq. 5}
\]
where \( \pi_{\theta}:=\pi(a \vert s, \theta) \) is policy parameterized by the parameter vector \( theta \). More often than not, \( theta \) will represent the set of trainable parameters in a neural network which we'll optimize using stochastic gradient descent, however \( \theta \) could be any parameterized representation of the policy, which could optimized using gradient-based (SGD) or derivative-free (e.g., MCMC sampling) methods. Here we'll derive the widely used gradient update rule. We are interested in the quantity \( \nabla_{\theta}E_{\tau\sim\pi}\big[R_{\tau} \vert \pi_{\theta} \big] \). Expanding this we can use the definition of gradient of a log of a function (expression 6) to yield the relationship in 7, which transforms the gradient of an expectation of a function to the expectation of the gradient of the log of a function. Expression 7 is known as the score function gradient estimator (SFGE).
\[
\begin{split}
\frac{d}{dx} log f(x) = \frac{\nabla_{x} f(x)}{f(x)}
\end{split} \tag{Eq. 6}\label{Eq. 6}
\]
\[
\begin{split}
\nabla_{\theta}E_{\tau\sim\pi}\big[R_{\tau} \vert \pi_{\theta} \big] &= \int_{\tau} \nabla_{\theta} p(\tau \vert \theta)R_{\tau} d\tau \\
&= \int_{\tau} p(\tau \vert \theta) \frac{\nabla_{\theta} p(\tau \vert \theta)}{p(\tau \vert \theta)}R_{\tau} d\tau \\
&= \int_{\tau} p(\tau \vert \theta) \ log \ \big(p(\tau \vert \theta)\big)R_{\tau} d\tau \\
&= E_{\tau\sim\pi}\big[R_{\tau}\nabla_{\theta} \ log \ p(\tau \vert \theta) \vert \pi_{\theta} \big]
\end{split} \tag{Eq. 7}\label{Eq. 7}
\]
Using (7), we can now factor \( p(\tau \vert \theta) \) using the factorization from (1) to yield:
\[
\begin{split}
\nabla_{\theta} \ log \ p(\tau \vert \theta) &= \nabla_{\theta} \ log \ \Bigg[ p(s_{0})\prod_{t=0}^{T-1} \pi(a_{t} \vert s_{t}) p(r_{t}, s_{t+1} \vert a_{t}, s_{t}) \Big] \Bigg] \\
&= \nabla_{\theta} \bigg[ \ log \ p(s_{0}) + \sum_{t=0}^{T-1} \big[ \ log \ \pi(a_{t} \vert s_{t}, \theta) + \ log \ p(r_{t}, s_{t+1} \vert a_{t}, s_{t}) \big] \bigg] \\
&= \sum_{t=0}^{T-1} \nabla_{\theta} \ log \ \pi(a_{t} \vert s_{t+1}, \theta)
\end{split} \tag{Eq. 8}\label{Eq. 8}
\]
The gradient of our original expected reward is therefore given by:
\[
\begin{split}
\nabla_{\theta} E_{\tau\sim\pi}\big[R_{\tau} \vert \pi_{\theta} \big] &= E_{\pi} \Big[R_{\tau} \sum_{t=0}^{T-1} \nabla_{\theta} \ log \ p(a_{t} \vert s_{t}, \theta) \Big] \\
&= E_{\pi} \Big[\sum_{t=0}^{T-1} \nabla_{\theta} \ log \ p(a_{t} \vert s_{t}, \theta) \sum_{i=t}^{T-1}\gamma^{i-t}r_{i}\Big]
\end{split} \tag{Eq. 9}\label{Eq. 9}
\]
Expression (9) is the vanilla policy gradient update rule. In practice, we execute a set of trajectories and log state action pairs along with a terminal reward. We then multiply the gradients in our model by the discounted reward, simple as that. This result is quite remarkable, yet intuitive. One of the issues with policy gradients is that they are very noise. One method to reduce variance is to use the \( \textit{advantage} \) function in lieu of the raw rewards, which simply subtracts the expected reward at each state (i.e., expression 4) from the discounted reward function used in expression (10).
\[
\begin{split}
\nabla_{\theta} E_{\tau\sim\pi}\big[R_{\tau} \vert \pi_{\theta} \big] &= E_{\pi} \Big[R_{\tau} \sum_{t=0}^{T-1} \nabla_{\theta} \ log \ p(a_{t} \vert s_{t}, \theta) \Big] \\
&= E_{\pi} \Big[\sum_{t=0}^{T-1} \nabla_{\theta} \ log \ p(a_{t} \vert s_{t}, \theta) \Big[ \sum_{i=t}^{T-1}\gamma^{i-t}r_{i} - v(s_{t}) \Big] \Big]
\end{split} \tag{Eq. 10}\label{Eq. 10}
\]
The intuition behind the advantage function is this: instead of assigning rewards on the basis of observing one rollout, why not compute the delta between that observation and that of all previous observations of that state. This is the idea behind actor-critic methods. The problem being addressed is policy myopia; the policy doesn't provide a global representation of the quality of actions, only the relative quality of the actions conditioned on the current state. To provide a way to nudge our short-sighted policy towards actions that are globally good, we can introduce a function (the critic) that keeps track of global state values, and then use its information in the gradient update of our policy. If you're familiar scheduled sampling techniques used in sequence modeling, the advantage function in this context plays a very similar role in reducing the variance of the noisy gradient estimator, enabling the model to converge faster.