Tuesday, July 6, 2004
10:00 - 11:00 AM
The value function of a Markov decision problem assigns to each policy, its expected discounted reward. This expected reward can be estimated as the empirical average of the reward over many independent simulation runs. We derive bounds on the number of runs needed for the convergence of the empirical average to the expected reward uniformly for a class of policies, in terms of the V-C or pseudo dimension of the policy class. Uniform convergence results are also obtained for the average reward case. They can be extended to partially observed MDPs and Markov games. The results can be viewed as an extension of the probably approximately correct (PAC) learning theory for partially observable MDPs (POMDPs) and Markov games (POMGames). This is joint work with Professor Pravin Varaiya.
Rahul Jain is a PhD candidate in EECS at the University of California, Berkeley. He graduated from IIT Kanpur with B.Tech in EE in 1997, MS in ECE from Rice University in 1999 and also obtained an MA in Statistics from UC Berkeley in 2002. His research interests are in learning theory, applied probability, network economics and games and wireless networks. He will graduate with PhD in August 2004.