Policy gradients are one of the standard techniques in reinforcement learning for training agents to take actions that maximize cumulative rewards. In this article, we'll focus on implementing policy gradients using PyTorch and the REINFORCE algorithm. This combination is powerful for creating agents that learn from their environment effectively.
Understanding Policy Gradients
Policy gradients involve learning a parameterized policy function, typically represented by a neural network, which directly maps observations to actions. The policy improves as the agent interacts with the environment and receives rewards. In essence, policy gradients adjust these parameters to increase the probability of actions that lead to higher rewards.
The REINFORCE Algorithm
REINFORCE is a simple yet effective policy gradient method. It involves running episodes, or complete sequences of states, actions, and rewards, collecting trajectory data, and then using that data to update the policy parameters via gradient ascent. The formula for the update rule is:
∇θ J(θ) ≈ E[∇θ log πθ(a|s) * R]
Here, θ are the policy parameters, J(θ) is the expected return, a is the action, s is the state, πθ(a|s) is the policy, and R is the total reward gained from the trajectory.
Implementing REINFORCE with PyTorch
Let's dive into the implementation, assuming you have PyTorch installed.
import torch
import torch.nn as nn
import torch.optim as optim
# Define the policy network
class PolicyNet(nn.Module):
def __init__(self, input_size, hidden_size, output_size):
super(PolicyNet, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.fc2 = nn.Linear(hidden_size, output_size)
def forward(self, x):
x = F.relu(self.fc1(x))
return F.softmax(self.fc2(x), dim=-1)
policy_net = PolicyNet(input_size=4, hidden_size=128, output_size=2)
optimizer = optim.Adam(policy_net.parameters(), lr=0.01)
The above code sets up a simple neural network suitable for a task like CartPole. It maps observations (4-dimensional input) to two possible actions.
Collecting Trajectories
To use REINFORCE, we need to collect trajectories. Here is how we might run a simulation to do so:
import gym
env = gym.make('CartPole-v1')
state = env.reset()
log_probs = []
rewards = []
for _ in range(1000): # run for 1000 time steps
state = torch.from_numpy(state).float().unsqueeze(0)
probs = policy_net(state)
m = torch.distributions.Categorical(probs)
action = m.sample()
log_probs.append(m.log_prob(action))
state, reward, done, _ = env.step(action.item())
rewards.append(reward)
if done:
breakThis loop run generates the actions taken and stores the rewards and log probabilities for each action-step pair.
Updating the Policy
Once trajectories are collected, we can proceed with the policy update:
R = 0
policy_loss = []
returns = []
# Calculate returns from the trajectory rewards
for r in rewards[::-1]:
R = r + 0.99 * R # 0.99 is the discount factor
returns.insert(0, R)
returns = torch.tensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-5)
# Calculate loss
for log_prob, R in zip(log_probs, returns):
policy_loss.append(-log_prob * R)
optimizer.zero_grad()
policy_loss = torch.cat(policy_loss).sum()
policy_loss.backward()
optimizer.step()This section normalizes the returns, prepares the policy loss using the negative log probabilities, and updates the policy parameters, completing REINFORCE.
Conclusion
In this article, we discussed implementing the REINFORCE algorithm using PyTorch. By understanding and applying policy gradients, you can create adaptive agents that learn to navigate their environments with increasing efficiency. While REINFORCE is only the starting point, it builds the foundation for more advanced techniques like Actor-Critic and Proximal Policy Optimization.