In Reinforcement Learning, we either use Monte Carlo (MC) estimates or Temporal Difference (TD) learning to establish the ‘target’ return from sample episodes. Both approaches allow us to learn from an environment in which transition dynamics are unknown, i.e., p(s',r|s,a) ...