Proximal Policy Gradient with Dual Network Architecture (PPO-DNA)

Overview

PPO-DNA is a more sample efficient variant of PPO, based on using separate optimizers and hyperparameters for the actor (policy) and critic (value) networks.

Original paper:

DNA: Proximal Policy Optimization with a Dual Network Architecture

Implemented Variants

Variants Implemented	Description
`ppo_dna_atari_envpool.py`, docs	Uses the blazing fast Envpool Atari vectorized environment.

Below are our single-file implementations of PPO-DNA:

`ppo_dna_atari_envpool.py`

The ppo_dna_atari_envpool.py has the following features:

Uses the blazing fast Envpool vectorized environment.
For Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
Works with the Atari's pixel Box observation space of shape (210, 160, 3)
Works with the Discrete action space

Warning

Note that ppo_dna_atari_envpool.py does not work in Windows and MacOs . See envpool's built wheels here: https://pypi.org/project/envpool/#files

Usage

poetry install -E envpool
python cleanrl/ppo_dna_atari_envpool.py --help
python cleanrl/ppo_dna_atari_envpool.py --env-id Breakout-v5

Explanation of the logged metrics

See related docs for ppo.py.

Implementation details

ppo_dna_atari_envpool.py uses a customized RecordEpisodeStatistics to work with envpool but has the same other implementation details as ppo_atari.py (see related docs).

Note that the original DNA implementation uses the StickyAction environment pre-processing wrapper (see (Machado et al., 2018)¹), but we did not implement it in ppo_dna_atari_envpool.py because envpool for now does not support StickyAction.

Experiment results

Below are the average episodic returns for ppo_dna_atari_envpool.py compared to ppo_atari_envpool.py.

Environment	`ppo_dna_atari_envpool.py`	`ppo_atari_envpool.py`
BattleZone-v5 (40M steps)	94800 ± 18300	28800 ± 6800
BeamRider-v5 (10M steps)	5470 ± 850	1990 ± 560
Breakout-v5 (10M steps)	321 ± 63	352 ± 52
DoubleDunk-v5 (40M steps)	-4.9 ± 0.3	-2.0 ± 0.8
NameThisGame-v5 (40M steps)	8500 ± 2600	4400 ± 1200
Phoenix-v5 (45M steps)	184000 ± 58000	10200 ± 2700
Pong-v5 (3M steps)	19.5 ± 1.1	16.6 ± 2.3
Qbert-v5 (45M steps)	12600 ± 4600	10800 ± 3300
Tennis-v5 (10M steps)	13.0 ± 2.3	-12.4 ± 2.9

Learning curves:

Tracked experiments:

Machado, Marlos C., Marc G. Bellemare, Erik Talvitie, Joel Veness, Matthew Hausknecht, and Michael Bowling. "Revisiting the arcade learning environment: Evaluation protocols and open problems for general agents." Journal of Artificial Intelligence Research 61 (2018): 523-562. ↩