A comparison of Temporal-Difference(0) and Constant-α Monte Carlo methods on the Random Walk Task

The Monte Carlo (MC) and the Temporal-Difference (TD) methods are both fundamental technics in the field of reinforcement learning; they solve the prediction problem based on the experiences from interacting with the environment rather than the environment’s model. However, the TD method is a combination of MC methods and Dynamic Programming (DP), making it differs from the MC method in the aspects of the update rule, the bootstrapping, and the bias/variance. TD methods are also proven to have better performance and faster convergence compared to the MC in most cases. In this post, we’ll compare TD and MC, or more specifically, the TD(0) and constant-α MC methods, on a simple grid environment and a more comprehensive Random Walk [2] environment. Hoping this post can help readers interested in Reinforcement Learning better understand how each method updates the state-value function and how their performance differs in the same testing environment. We will implement algorithms and comparisons in Python, and libraries used in this post are as follows: <a href="https://towardsdatascience.com/a-comparison-of-temporal-difference-0-and-constant-%CE%B1-monte-carlo-methods-on-the-random-walk-task-bc6497eb7c92">Learn More</a>