Abstract:
Consider the problem of sequential sampling from m statistical populations to maximize the expected sum of outcomes in the long run. Under suitable assumptions on the unknown parameters θ= ∈ Θ, it is shown that there exists a class CR of adaptive policies with the following properties: (i) The expected n horizon reward Vπ0 n (θ=) under any policy π0 in CR is equal to nμ,*(η=) - M(θ=)log n + o(log n), as n → ∞, where μ*(θ=) is the largest population mean and M(θ=) is a constant. (ii) Policies in CR are asymptotically optimal within a larger class CUF of "uniformly fast convergent" policies in the sense that limn → ∞(nμ,*(θ=) - Vπ0 n (θ=))/ (nμ*(θ=) - Vπ n(θ=)) ≤ 1, for any π ∈ CUF and any θ= ∈ Θ such that M(θ=) > 0. Policies in CR are specified via easily computable indices, defined as unique solutions to dual problems that arise naturally from the functional form of M(θ=). In addition, the assumptions are verified for populations specified by nonparametric discrete univariate distributions with finite support. In the case of normal populations with unknown means and variances, we leave as an open problem the verification of one assumption. © 1996 Academic Press, Inc.
Notes:
cited By 35
Website