On-Policy vs. Off-Policy Monte Carlo, With Visualizations

In Reinforcement Learning, we either use Monte Carlo (MC) estimates or Temporal Difference (TD) learning to establish the ‘target’ return from sample episodes. Both approaches allow us to learn from an environment in which transition dynamics are unknown, i.e., <code>p(s',r|s,a)</code> is unknown. MC uses the full returns from a state-action pair until the terminal state is reached. It has a high variance but is unbiased when the samples are independent and identically distributed. I will save comparisons between MC and TD for another day, to be supported by codes. For today, the focus is on MC itself. I will talk about the difference between on-policy and off-policy MC, substantiated with concrete results from plug-and-play code that you can try with different inputs. <a href="https://pub.towardsai.net/on-policy-vs-off-policy-monte-carlo-with-visualizations-5a9bc40036a2">Click Here</a>