Proximal Policy Gradient with Dual Network Architecture (PPO-DNA)
Overview
PPO-DNA is a more sample efficient variant of PPO, based on using separate optimizers and hyperparameters for the actor (policy) and critic (value) networks.
Original paper:
Implemented Variants
Variants Implemented | Description |
---|---|
ppo_dna_atari_envpool.py , docs |
Uses the blazing fast Envpool Atari vectorized environment. |
Below are our single-file implementations of PPO-DNA:
ppo_dna_atari_envpool.py
The ppo_dna_atari_envpool.py has the following features:
- Uses the blazing fast Envpool vectorized environment.
- For Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
- Works with the Atari's pixel
Box
observation space of shape(210, 160, 3)
- Works with the
Discrete
action space
Warning
Note that ppo_dna_atari_envpool.py
does not work in Windows and MacOs . See envpool's built wheels here: https://pypi.org/project/envpool/#files
Usage
poetry install -E envpool
python cleanrl/ppo_dna_atari_envpool.py --help
python cleanrl/ppo_dna_atari_envpool.py --env-id Breakout-v5
Explanation of the logged metrics
See related docs for ppo.py
.
Implementation details
ppo_dna_atari_envpool.py uses a customized RecordEpisodeStatistics
to work with envpool but has the same other implementation details as ppo_atari.py
(see related docs).
Note that the original DNA implementation uses the StickyAction
environment pre-processing wrapper (see (Machado et al., 2018)1), but we did not implement it in ppo_dna_atari_envpool.py because envpool for now does not support StickyAction
.
Experiment results
Below are the average episodic returns for ppo_dna_atari_envpool.py
compared to ppo_atari_envpool.py
.
Environment | ppo_dna_atari_envpool.py |
ppo_atari_envpool.py |
---|---|---|
BattleZone-v5 (40M steps) | 94800 ± 18300 | 28800 ± 6800 |
BeamRider-v5 (10M steps) | 5470 ± 850 | 1990 ± 560 |
Breakout-v5 (10M steps) | 321 ± 63 | 352 ± 52 |
DoubleDunk-v5 (40M steps) | -4.9 ± 0.3 | -2.0 ± 0.8 |
NameThisGame-v5 (40M steps) | 8500 ± 2600 | 4400 ± 1200 |
Phoenix-v5 (45M steps) | 184000 ± 58000 | 10200 ± 2700 |
Pong-v5 (3M steps) | 19.5 ± 1.1 | 16.6 ± 2.3 |
Qbert-v5 (45M steps) | 12600 ± 4600 | 10800 ± 3300 |
Tennis-v5 (10M steps) | 13.0 ± 2.3 | -12.4 ± 2.9 |
Learning curves:
data:image/s3,"s3://crabby-images/e1e31/e1e31201051feb29d23fa0f03e55b9c2907429ae" alt=""
data:image/s3,"s3://crabby-images/ab996/ab9961aa15600754e35828781f7d046646d81ed5" alt=""
data:image/s3,"s3://crabby-images/2be41/2be415d2d82d3e5e673921803de651bee99dc107" alt=""
data:image/s3,"s3://crabby-images/062d7/062d738275697412630a4ad37e3b498284f9251c" alt=""
data:image/s3,"s3://crabby-images/32100/321006aee9cb1f33c8c92f7ffa0ab4091a9be6c9" alt=""
data:image/s3,"s3://crabby-images/bb217/bb21712fa5b7026661df66803e194eaa40033434" alt=""
data:image/s3,"s3://crabby-images/e1cc6/e1cc6c1755818a1988dcef9b28f437c5289adb47" alt=""
data:image/s3,"s3://crabby-images/fd6bd/fd6bdad596574ff9c3807cb1e3702778c91b6ae1" alt=""
data:image/s3,"s3://crabby-images/84266/84266695b44261548d1d0d8a5a9c67c1aee3809c" alt=""
data:image/s3,"s3://crabby-images/a06fb/a06fb6daf4348643005f0b10731ca3927937bef6" alt=""
data:image/s3,"s3://crabby-images/8caa7/8caa77657b6680bab55ecfc5c4b7436c58d5518f" alt=""
data:image/s3,"s3://crabby-images/f99ab/f99ab77351ed344829f51744236965b2482b96d7" alt=""
data:image/s3,"s3://crabby-images/5a52b/5a52b35f559e23a83dc701c0913dd530e0b87442" alt=""
data:image/s3,"s3://crabby-images/869ce/869ceb83da169cc279dda988ab019bd12c6c6a87" alt=""
data:image/s3,"s3://crabby-images/f6978/f6978b8c6c3c5bce6a4a659687d51e33df928f57" alt=""
data:image/s3,"s3://crabby-images/88fb0/88fb0da3a62eb07328d0e646fff431e313214fb3" alt=""
data:image/s3,"s3://crabby-images/93b3b/93b3b9a1e8b805bbd941d8fc7b7120d70f9e0486" alt=""
data:image/s3,"s3://crabby-images/5415c/5415c84a7a90d9f9cb376e7087f8e2084830d008" alt=""
Tracked experiments:
-
Machado, Marlos C., Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. "Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents." Journal of Artificial Intelligence Research 61 (2018): 523-562. ↩