Mastering Policy Gradients Using PyTorch and REINFORCE

Policy gradients are one of the standard techniques in reinforcement learning for training agents to take actions that maximize cumulative rewards. In this article, we'll focus on implementing policy gradients using PyTorch and the REINFORCE algorithm. This combination is powerful for creating agents that learn from their environment effectively.

Understanding Policy Gradients
The REINFORCE Algorithm
Implementing REINFORCE with PyTorch
Collecting Trajectories
Updating the Policy
Conclusion

Understanding Policy Gradients

Policy gradients involve learning a parameterized policy function, typically represented by a neural network, which directly maps observations to actions. The policy improves as the agent interacts with the environment and receives rewards. In essence, policy gradients adjust these parameters to increase the probability of actions that lead to higher rewards.

The REINFORCE Algorithm

REINFORCE is a simple yet effective policy gradient method. It involves running episodes, or complete sequences of states, actions, and rewards, collecting trajectory data, and then using that data to update the policy parameters via gradient ascent. The formula for the update rule is:

 ∇θ J(θ) ≈ E[∇θ log πθ(a|s) * R]

Here, θ are the policy parameters, J(θ) is the expected return, a is the action, s is the state, πθ(a|s) is the policy, and R is the total reward gained from the trajectory.

Implementing REINFORCE with PyTorch

Let's dive into the implementation, assuming you have PyTorch installed.

import torch
import torch.nn as nn
import torch.optim as optim

# Define the policy network
class PolicyNet(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(PolicyNet, self).__init__()
        self.fc1 = nn.Linear(input_size, hidden_size)
        self.fc2 = nn.Linear(hidden_size, output_size)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        return F.softmax(self.fc2(x), dim=-1)

policy_net = PolicyNet(input_size=4, hidden_size=128, output_size=2)
optimizer = optim.Adam(policy_net.parameters(), lr=0.01)

The above code sets up a simple neural network suitable for a task like CartPole. It maps observations (4-dimensional input) to two possible actions.

Collecting Trajectories

To use REINFORCE, we need to collect trajectories. Here is how we might run a simulation to do so:

import gym

env = gym.make('CartPole-v1')
state = env.reset()

log_probs = []
rewards = []

for _ in range(1000):  # run for 1000 time steps
    state = torch.from_numpy(state).float().unsqueeze(0)
    probs = policy_net(state)
    m = torch.distributions.Categorical(probs)
    action = m.sample()
    log_probs.append(m.log_prob(action))
    state, reward, done, _ = env.step(action.item())
    rewards.append(reward)
    if done:
        break

This loop run generates the actions taken and stores the rewards and log probabilities for each action-step pair.

Updating the Policy

Once trajectories are collected, we can proceed with the policy update:

R = 0
policy_loss = []
returns = []

# Calculate returns from the trajectory rewards
for r in rewards[::-1]:
    R = r + 0.99 * R  # 0.99 is the discount factor
    returns.insert(0, R)
returns = torch.tensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-5)

# Calculate loss
for log_prob, R in zip(log_probs, returns):
    policy_loss.append(-log_prob * R)

optimizer.zero_grad()
policy_loss = torch.cat(policy_loss).sum()
policy_loss.backward()
optimizer.step()

This section normalizes the returns, prepares the policy loss using the negative log probabilities, and updates the policy parameters, completing REINFORCE.

Conclusion

In this article, we discussed implementing the REINFORCE algorithm using PyTorch. By understanding and applying policy gradients, you can create adaptive agents that learn to navigate their environments with increasing efficiency. While REINFORCE is only the starting point, it builds the foundation for more advanced techniques like Actor-Critic and Proximal Policy Optimization.

Next Article: Efficient Implementation of Actor-Critic Models in PyTorch

Previous Article: Implementing Deep Q-Networks (DQN) in PyTorch for Complex Environments

Series: PyTorch Transfer Learning & Reinforcement Learning

PyTorch