Learning time-dependent Feedback Policies with Model-Based Policy Search


Stochastic Optimal Control (SOC) is typically used to plan a movement for a specific situation. However, most SOC methods are not able to generalize the movement plan to a new situation, and, hence, replanning is required. In this paper we present a SOC method that allows us for reusing the controller in a new situation as it is more robust to deviations from the initial movement plan.In order to improve the robustness of the controller, we employ an information-theoretic policy update which explicitly operates on trajectory distributions instead of single trajectories. Our information theoretic policy update limits the ‘distance’ between the trajectory distributions of the old and the new control policy and ensures a stable and smooth policy update. The introduced bound offers a closed form solution for the resulting policy and extends results from recent developments in SOC. In difference to many standard SOC algorithms, our approach can directly infer the system dynamics from data points, and, hence, can also be used for model-based reinforcement learning.

Msc Thesis