Monte Carlo Methods

Monte Carlo (MC) methods do not require the assumption of complete knowledge of the environment. Instead, they require only experience, i.e., sample sequences of state, actions and rewards from actual or simulated interaction with the environment. MC methods then solve the reinforcement learning problem by averaging sample returns.

MC methods sample and average returns for each state-action pair, where each of them are inter-related i.e, return after taking an action in one state depends on the action taken in later states within the same episode. This introduces the problem of non-stationarity. To handle non-stationarity, we use Geneeralized Policy Iteration (GPI), and learn the value-fuction from sample returns.

Monte Carlo Predictions

The value of a state is the expected return, which is the expected cumulative future discounted reward, starting from that state. In MC method, we simply average the returns observed after visits to that state. As more returns are observed, the average should converge to the expected value.

Each occurence of state \(s\) in an episode is called a visit of \(s\).
\(s\) maybe visited multiple times in the same episode. When \(s\) visited first time in an episode, it is called first visit to \(s\).
First-visit MC method estimate \(V_{π}(s)\) as the average of the returns following first visit to \(s\).
Every-visit MC method averages returns from the all following visit to \(s\).
Both first-visit and Every-visit converge to \(v_{\pi}(s)\) as the number of visits to \(s\) goes infinity.
In MC method, the estimate for one state does not build upon the estimate of any other.
Also, computational expense of estimating the value of a single state in independent of the number of states.

Monte Carlo Estimation of Action Value

When a model is available, state values are sufficient to determine a policy by choosing whichever action that leads to the best combination of reward and the next state. However, when a model is not available, action values (state-action pairs) are required. Therefore, the primary goal of MC methods is to estimate \(q_{\ast}\) by considering visits to state-action pairs rather than just states.

The main complication, in this approach, is that many state-action pairs may never be visited. When following a deterministic policy \(\pi\), one can observe retruns only for one of the actions from each state, which may not improve the Monte Carlo estimates. This is general problem of maintaining exploration, and to overcome this, we need to estimate the values from all the action from each state assuring continual exploration.

First Approach – By specifying an episode start in a state-action pair, ensuring that every pair has a nonzero probability of being selected as the start. This guarantees that all state-action pairs will be visited infinite number of times in the limit of an infinite number of episodes. This assumption is called Exploring Starts
Second Approach – To consider only policies that are stochastic, with a nonzero probability of selecting all actions in each state.

Monte Carlo Control

In Geneeralized Policy Iteration (GPI), both an approximate policy and an approximate value function are maintained. In Monte Carlo version of GPI, we similarly perform alternating steps of policy evaluation and policy improvement, beginning with an arbitrary policy \(\pi_{0}\) and ending with the optimal policy and optimal action-value function.

\[π_{0} \overset{E}\rightarrow q\pi_{0} \overset{I}→ π_{1} \overset{E}\rightarrow q\pi_{2} \overset{I}\rightarrow π \notag\]

where \(E\) denotes policy evaluation and \(I\) denotes policy improvement.

Assuming an infinite number of episodes are observed and episodes are generated with Exploring Starts, the MC method will compute \(q_{\pi_{k}}\) each for arbitrary \(\pi_{k}\).

Policy improvement is done by making the policy greedy with respect to the current value function. Then for any action-value function, \(q\), the corresponding greedy policy is the one that, for each \(s ∈ S\) deterministically chooses an action with maximum action-value.

\[π_{s} \dot{=} \text{arg} \max_{a} q(s,a) \notag\]

Policy improvement then can be done by constructing each \(\pi_{k+1}\) as the greedy policy with respect to \(q_{\pi_{k}}\). The policy improvement theorem tehn applies to \(\pi_{k}\) and \(\pi_{k+1}\) for all \(s ∈ S\) because

\[\begin{align} q_{\pi_{k}}(s,\pi_{k+1}(s)) &= q_{\pi_{k}} (s, \text{arg} \max_{a} q_{\pi_{k}}(s,a)) \notag\\ &= \max_{a} q_{\pi_{k}} (s,a) \notag\\ & \ge q_{\pi_{k}} (s, \pi_{k}(s)) \notag\\ & \ge v_{\pi_{k}(s)} \notag\\ \end{align}\]

PREVIOUSRL00 - Markov Decision Process (MDP) and Bellman Optimality Equation

NEXTRL02 - Off-Policy via Importance Sampling