Many complex autonomous systems (e.g., electrical distribution networks) repeatedly select actions with the aim of achieving a given objective. Reinforcement learning (RL) offers a powerful framework for acquiring adaptive behaviour in this setting, associating a scalar reward with each action and learning from experience which action to select to maximise long-term reward. Although RL has produced impressive results recently (e.g., achieving human-level play in Atari games and beating the human world champion in the board game Go), most existing solutions only work under strong assumptions: the environment model is stationary, the objective is fixed, and trials end once the objective is met.

The aim of this project is to advance the state of the art of fundamental research in lifelong RL by developing several novel RL algorithms that relax the above assumptions. The new algorithms should be robust to environmental changes, both in terms of the observations that the system can make and the actions that the system can perform. Moreover, the algorithms should be able to operate over long periods of time while achieving different objectives. The proposed algorithms will address three key problems related to lifelong RL: planning, exploration, and task decomposition.

  • Planning is the problem of computing an action selection strategy given a (possibly partial) model of the task at hand.
  • Exploration is the problem of selecting actions with the aim of mapping out the environment rather than achieving a particular objective.
  • Task decomposition is the problem of defining different objectives and assigning a separate action selection strategy to each.

The algorithms will be evaluated in two realistic scenarios: active network management for electrical distribution networks, and microgrid management. A test protocol will be developed to evaluate each individual algorithm, as well as their combinations.

RL allows autonomous systems to improve themselves without human intervention, a central subject of the call. The novel algorithms aim to take advantage of previous knowledge each time the environment changes, avoiding relearning from scratch as much as possible. Many objectives will be initially unknown, and the system has to learn to achieve new objectives as they appear. Active exploration of the environment is of fundamental concern to the project, and the algorithms for task decomposition will automatically identify new subgoals to achieve. The project aims at improving on the state-of-the-art in the two scenarios, as well as increasing their flexibility by allowing distribution networks to change over time. The test protocol will require different metrics than those typically used in RL to evaluate algorithms.