Mobile Navigation
Raw Text
Close
Search
Skip to main content
Site Navigation
Research Overview Index
Product Overview ChatGPT GPT-4 DALL·E 2 Customer stories Safety standards API data privacy Pricing
Developers Overview Documentation API reference Examples
Safety
Company About Blog Careers Charter Security
Search
Navigation quick links
Log in
Sign up
Menu
Close
Site Navigation
Research
Product
Developers
Safety
Company
Quick Links
Log in
Sign up
Search
Learning a hierarchy
October 26, 2017
More resources
Read paper
View code
Reinforcement learning ,
Meta-learning ,
Publication ,
Release
Learning a hierarchy 00:45
Humans solve complicated challenges by breaking them up into small, manageable components. Grilling pancakes consists of a series of high-level actions, such as measuring flour, whisking eggs, transferring the mixture to the pan, turning the stove on, and so on. Humans are able to learn new tasks rapidly by sequencing together these learned components, even though the task might take millions of low-level actions, i.e., individual muscle contractions.
On the other hand, today’s reinforcement learning methods operate through brute force search over low-level actions, requiring an enormous number of attempts to solve a new task. These methods become very inefficient at solving tasks that take a large number of timesteps.
Our solution is based on the idea of hierarchical reinforcement learning, where agents represent complicated behaviors as a short sequence of high-level actions. This lets our agents solve much harder tasks: while the solution might require 2000 low-level actions, the hierarchical policy turns this into a sequence of 10 high-level actions, and it’s much more efficient to search over the 10-step sequence than the 2000-step sequence.
Meta-learning shared hierarchies
Our algorithm, meta-learning shared hierarchies  (MLSH), learns a hierarchical policy where a master policy switches between a set of sub-policies. The master selects an action every every N timesteps, where we might take N=200. A sub-policy executed for N timesteps constitutes a high-level action, and for our navigation tasks, sub-policies correspond to walking or crawling in different directions.
In most prior work, hierarchical policies have been explicitly hand-engineered. Instead, we aim to discover this hierarchical structure automatically through interaction with the environment. Taking a meta-learning perspective, we define a good hierarchy as one that quickly reaches high reward quickly when training on unseen tasks. Hence, the MLSH algorithm aims to learn sub-policies that enable fast learning on previously unseen tasks.
We train on a distribution over tasks, sharing the sub-policies while learning a new master policy on each sampled task. By repeatedly training new master policies, this process automatically finds sub-policies that accommodate the master policy’s learning dynamics.
Experiments
In our AntMaze environment, a Mujoco Ant robot is placed into a distribution of 9 different mazes and must navigate from the starting position to the goal. Our algorithm is successfully able to find a diverse set of sub-policies that can be sequenced together to solve the maze tasks, solely through interaction with the environment. This set of sub-policies can then be used to master a larger task than the ones they were trained on (see video at beginning at post).
Code
We’re releasing the code  for training the MLSH agents, as well as the MuJoCo environments we built to evaluate these algorithms.
Authors
Kevin Frans
Jonathan Ho
Peter Chen
Pieter Abbeel
Research
Overview
Index
Product
Overview
ChatGPT
GPT-4
DALL·E 2
Customer stories
Safety standards
API data privacy
Pricing
Safety
Overview
Company
About
Blog
Careers
Charter
Security
OpenAI © 2015 – 2023
Terms & policies
Privacy policy
Brand guidelines
Social
YouTube
GitHub
SoundCloud
Back to top
Single Line Text
Close. Search. Skip to main content. Site Navigation. Research Overview Index. Product Overview ChatGPT GPT-4 DALL·E 2 Customer stories Safety standards API data privacy Pricing. Developers Overview Documentation API reference Examples. Safety. Company About Blog Careers Charter Security. Search. Navigation quick links. Log in. Sign up. Menu. Close. Site Navigation. Research. Product. Developers. Safety. Company. Quick Links. Log in. Sign up. Search. Learning a hierarchy. October 26, 2017. More resources. Read paper. View code. Reinforcement learning , Meta-learning , Publication , Release. Learning a hierarchy 00:45. Humans solve complicated challenges by breaking them up into small, manageable components. Grilling pancakes consists of a series of high-level actions, such as measuring flour, whisking eggs, transferring the mixture to the pan, turning the stove on, and so on. Humans are able to learn new tasks rapidly by sequencing together these learned components, even though the task might take millions of low-level actions, i.e., individual muscle contractions. On the other hand, today’s reinforcement learning methods operate through brute force search over low-level actions, requiring an enormous number of attempts to solve a new task. These methods become very inefficient at solving tasks that take a large number of timesteps. Our solution is based on the idea of hierarchical reinforcement learning, where agents represent complicated behaviors as a short sequence of high-level actions. This lets our agents solve much harder tasks: while the solution might require 2000 low-level actions, the hierarchical policy turns this into a sequence of 10 high-level actions, and it’s much more efficient to search over the 10-step sequence than the 2000-step sequence. Meta-learning shared hierarchies. Our algorithm, meta-learning shared hierarchies  (MLSH), learns a hierarchical policy where a master policy switches between a set of sub-policies. The master selects an action every every N timesteps, where we might take N=200. A sub-policy executed for N timesteps constitutes a high-level action, and for our navigation tasks, sub-policies correspond to walking or crawling in different directions. In most prior work, hierarchical policies have been explicitly hand-engineered. Instead, we aim to discover this hierarchical structure automatically through interaction with the environment. Taking a meta-learning perspective, we define a good hierarchy as one that quickly reaches high reward quickly when training on unseen tasks. Hence, the MLSH algorithm aims to learn sub-policies that enable fast learning on previously unseen tasks. We train on a distribution over tasks, sharing the sub-policies while learning a new master policy on each sampled task. By repeatedly training new master policies, this process automatically finds sub-policies that accommodate the master policy’s learning dynamics. Experiments. In our AntMaze environment, a Mujoco Ant robot is placed into a distribution of 9 different mazes and must navigate from the starting position to the goal. Our algorithm is successfully able to find a diverse set of sub-policies that can be sequenced together to solve the maze tasks, solely through interaction with the environment. This set of sub-policies can then be used to master a larger task than the ones they were trained on (see video at beginning at post). Code. We’re releasing the code  for training the MLSH agents, as well as the MuJoCo environments we built to evaluate these algorithms. Authors. Kevin Frans. Jonathan Ho. Peter Chen. Pieter Abbeel. Research. Overview. Index. Product. Overview. ChatGPT. GPT-4. DALL·E 2. Customer stories. Safety standards. API data privacy. Pricing. Safety. Overview. Company. About. Blog. Careers. Charter. Security. OpenAI © 2015 – 2023. Terms & policies. Privacy policy. Brand guidelines. Social. Twitter. YouTube. GitHub. SoundCloud. LinkedIn. Back to top.