On-Policy vs. Off-Policy Monte Carlo, With Visualizations

<p>In Reinforcement Learning, we either use Monte Carlo (MC) estimates or Temporal Difference (TD) learning to establish the &lsquo;target&rsquo; return from sample episodes. Both approaches allow us to learn from an environment in which transition dynamics are unknown, i.e.,&nbsp;<code>p(s&#39;,r|s,a)</code>&nbsp;is unknown.</p> <p>MC uses the full returns from a state-action pair until the terminal state is reached. It has a high variance but is unbiased when the samples are independent and identically distributed.</p> <p>I will save comparisons between MC and TD for another day, to be supported by codes. For today, the focus is on MC itself. I will talk about the difference between on-policy and off-policy MC, substantiated with concrete results from plug-and-play code that you can try with different inputs.</p> <p><a href="https://pub.towardsai.net/on-policy-vs-off-policy-monte-carlo-with-visualizations-5a9bc40036a2"><strong>Click Here</strong></a></p> <p>&nbsp;</p>
Tags: Monte