Policy Optimization methods enjoy wide practical use in reinforcement learning (RL) for applications ranging from robotic manipulation to game-playing, partly because they are easy to implement and allow for richly parameterized policies. Yet their theoretical properties, from optimality to statistical complexity, are still not fully understood. To help develop a theoretical basis for these methods, and to bridge the gap between RL and control theoretic approaches, recent work has studied whether gradient-based policy optimization can succeed in designing feedback control policies.
In this talk, we start by showing the convergence and optimality of these methods for linear dynamical systems with quadratic costs, where despite nonconvexity, convergence to the optimal policy occurs under mild assumptions. Next, we make a connection between convex parameterizations in control theory on one hand, and the Polyak-Lojasiewicz property of the nonconvex cost function, on the other. Such a connection between the nonconvex and convex landscapes provides a unified view towards extending the results to more complex control problems.