A simulation method is developed for computing average reward optimal policies, for a finite state and action Markovian decision process. It is shown that the method is consistent; i.e., it produces solutions arbitrarily close to the optimal. Various types of estimation errors and confidence bounds are examined. Finally, it is shown that the probability distribution of the number of simulation cycles required to compute an é-optimal policy satisfies a large deviations property. © 1995, Cambridge University Press. All rights reserved.
We consider the problem of sequential control for a finite state and action Markovian Decision Process with incomplete information regarding the transition probabilities P ∈ Papprox. Under suitable irreducibility assumptions for Papprox.. We construct adaptive policies that maximize the rate of convergence of realized rewards to that of the optimal (non adaptive) policy under complete information. These adaptive policies are specified via an easily computable index function, of states, controls and statistics, so that one takes a control with the largest index value in the current state in every period.