The Policy Gradient Theorem is a significant result in the field of reinforcement learning, specifically for policy gradient methods. The theorem furnishes us with an equation to compute the gradient of the expected cumulative reward with respect to the policy parameters. These gradients are then employed to refine the policy, incrementally adjusting it to maximize the expected return.
The theorem formally states that:
$$ \nabla_{\theta} J(\theta) = \mathbb{E}_{\pi}[R_t|s=s_0, a=\pi(s;\theta)] $$
In this equation, \( \nabla_{\theta} J(\theta) \) represents the gradient of the expected return \( J(\theta) \) with respect to the policy parameters \( \theta \). \( \mathbb{E}_{\pi}[...] \) signifies taking an expectation under policy \( \pi \), which essentially means averaging over a lot of instances following policy \( \pi \).
The term \( \nabla_{\theta} \log \pi(a_t|s_t;\theta) \) is the gradient of the log-probability of selecting action \( a_t \) at state \( s_t \) given the policy \( \pi \) parameterized by \( \theta \). It shows how much the log-probability of the action changes as we slightly modify the policy parameters.
The term \( Q^{\pi}(s_t, a_t) \) is the action-value function under policy \( \pi \). This function evaluates the expected return from state \( s_t \), taking action \( a_t \) and following policy \( \pi \) thereafter. It provides a measure of how good it is to take a particular action in a specific state under policy \( \pi \).
The policy gradient algorithm is a type of reinforcement learning method where the goal is to directly optimize the policy function without needing an intermediate value function. Unlike value-based methods which try to learn the value of each state or action and use these to determine the policy, policy gradient methods optimize the policy directly.
In a policy gradient method, the policy, denoted as \( \pi(a|s;\theta) \), is parameterized by \( \theta \). The reinforcement learning objective is to find the policy that maximizes the expected return from each state, represented as:
$$
J(\theta) = \mathbb{E}_{\pi}[R_t|s=s_0, a=\pi(s;\theta)]
$$
Here, \( R_t \) is the return (discounted cumulative future reward) from time step \( t \).
The optimization problem here is to find the \( \theta \) that maximizes \( J(\theta) \), and this can be achieved using gradient ascent. In each step, the policy parameters are updated using the rule:
$$
\theta \leftarrow \theta + \alpha \nabla_{\theta} J(\theta)
$$
This equation says that the policy parameters \( \theta \) should be nudged in the direction that increases the expected return, scaled by a learning rate \( \alpha \).
REINFORCE is a particular policy gradient method introduced by Ronald Williams in 1992. The term REINFORCE is an acronym, standing for "REward Increment = Nonnegative Factor x Offset Reinforcement x Characteristic Eligibility". In the REINFORCE algorithm, the gradient of the expected return is estimated for each episode, and the policy is updated at the end of the episode. The update rule for REINFORCE can be summarized as follows: $$ \nabla_{\theta} J(\theta) = \mathbb{E}_{\pi}[\nabla_{\theta} \log \pi(a_t|s_t;\theta) * G_t] $$ Here, \( G_t \) is the return from time step \( t \), and \( \nabla_{\theta} \log \pi(a_t|s_t;\theta) \) is the gradient of the log-probability of taking action \( a_t \) in state \( s_t \) under policy \( \pi \). This rule implies that if the return \( G_t \) is higher than expected, the probability of selecting action \( a_t \) in state \( s_t \) should be increased, and vice versa.
Let's try to simplify these complex terms and ideas. Imagine you're trying to teach a robot to make perfect pancakes. The Policy Gradient Theorem is like a magic recipe that tells the robot how to change its pancake-making process (policy) to make the pancakes better (maximize the return, which is the quality of the pancakes). This magic recipe tells the robot to look at how good its action (like flipping the pancake) was in terms of the final pancake quality and how much it can change the quality if it tweaks its actions slightly. Then it suggests to change the pancake-making process in the direction that will improve the pancakes most. The Policy Gradient Algorithm is like the robot's strategy to improve its pancake-making process. The robot tries different things (actions) in making pancakes (states) and observes how good the pancakes turn out (return). Then, it updates its pancake-making process (policy) to do more of what worked well and less of what didn't. REINFORCE is a particular strategy where the robot waits until it finishes a whole batch of pancakes (an episode) and then reflects on how it can improve. It looks at each step (like mixing the batter, pouring it on the pan, flipping the pancake, etc.) and sees how good the final pancakes were. If a step resulted in good pancakes, the robot will do more of that in the future, and if a step resulted in bad pancakes, it will do less of it.