Exploration and Explotation (Multi-armed Bandits) - Chapter 2 - Sutton and Barto

Published: September 09, 2024

Here I focus on the challenge of balancing exploration and exploitation in reinforcement learning (RL), crucial for maximizing cumulative rewards over time. We examined the multi-armed bandit problem as a simplified RL setting where the agent must learn to choose actions (or “arms”) with the best rewards through action-value methods.

The greedy algorithm exploits current knowledge but risks missing better actions, while the epsilon-greedy algorithm introduces random exploration to avoid getting stuck in suboptimal choices. We also explored policy gradients, which optimize policies directly, and the Upper Confidence Bound (UCB) algorithm, which balances exploitation with uncertainty-driven exploration. These algorithms provide different strategies for managing the exploration-exploitation trade-off, each offering distinct advantages depending on the problem.

Click here to access my GitHub code

Share on

Twitter Facebook LinkedIn

Yaro Kazakov

Share on