How to Monitor Model Training in PyTorch

Monitoring model training in PyTorch is essential for understanding how well your model is learning from data, ensuring that everything is working as expected, and debugging any issues that arise during the process. This article will walk you through methods to monitor your model training, utilizing various tools and libraries effectively.

1. Print Logs
2. Use TensorBoard
3. Use Progress Bars with TQDM
4. Use Weights & Biases
5. Conclusion

1. Print Logs

The simplest and quickest method of monitoring model training involves printing logs. By logging key metrics like loss and accuracy during training, you can understand how well your model is performing at each epoch.

for epoch in range(num_epochs):
    for data, labels in train_loader:
        # Forward pass
        outputs = model(data)
        loss = criterion(outputs, labels)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
    print(f'Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}')

2. Use TensorBoard

TensorBoard is a visualization toolkit that provides the necessary tools to monitor and visualize various metrics during your model's training in real-time. PyTorch supports TensorBoard directly using torch.utils.tensorboard package.

from torch.utils.tensorboard import SummaryWriter

writer = SummaryWriter()

for epoch in range(num_epochs):
    total_loss = 0
    for data, labels in train_loader:
        outputs = model(data)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        total_loss += loss.item()
    writer.add_scalar('Training Loss', total_loss / len(train_loader), epoch)

writer.close()

With the above code, you can run TensorBoard by executing tensorboard --logdir=runs in your command line, and monitor the training process in your browser.

3. Use Progress Bars with TQDM

Another way to monitor your training in PyTorch is by using TQDM, which provides a smart progress bar that can make it easy to see how long each epoch will take.

from tqdm import tqdm

for epoch in range(num_epochs):
    loop = tqdm(train_loader, total=len(train_loader), leave=False)
    for data, labels in loop:
        outputs = model(data)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        loop.set_description(f'Epoch [{epoch+1}/{num_epochs}]')
        loop.set_postfix(loss=loss.item())

4. Use Weights & Biases

Weights & Biases (W&B) is a tool that allows automatic and configurable logging of hyperparameters, model metrics, gradients, and more. It provides a powerful UI to monitor training progress remotely.

import wandb

wandb.init(project='project_name')

for epoch in range(num_epochs):
    running_loss = 0.0
    for i, (inputs, labels) in enumerate(train_loader, 0):
        outputs = model(inputs)
        loss = criterion(outputs, labels)
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        running_loss += loss.item()
        if i % 2000 == 1999:    # Print every 2000 mini-batches
            wandb.log({'loss': running_loss / 2000})
            running_loss = 0.0

To use Weights & Biases, ensure you have its package installed by running pip install wandb first.

5. Conclusion

Monitoring your model's performance during training is crucial to identify and resolve potential issues quickly. By using logging, TensorBoard, TQDM, or Weights & Biases, you can keep track of how well your model is evolving in real-time. Each of these methods has its place, and depending on your needs, you may find one more suitable than others. Start by incorporating one of these into your workflow, and you should find it improves your understanding and control over the training process.

Next Article: Common Pitfalls When Training PyTorch Models and How to Avoid Them

Previous Article: Running Your PyTorch Training Loop Epoch by Epoch

Series: The First Steps with PyTorch

PyTorch