Library for reinforcement learning in Java, version 0.3.8
Repository includes algorithms, examples, and exercises from the 2nd edition of Reinforcement Learning: An Introduction by Richard S. Sutton, and Andrew G. Barto.
Our implementation is inspired by the python code by Shangtong Zhang, but differs from the reference in two aspects:
- the algorithms are implemented separate from the problem scenarios
- the math is in exact precision which reproduces symmetries in the results in case the problem features symmetries
- Iterative Policy Evaluation (parallel, in 4.1, p.59)
- Value Iteration to determine V*(s) (parallel, in 4.4, p.65)
- Action-Value Iteration to determine Q*(s,a) (parallel)
- First Visit Policy Evaluation (in 5.1, p.74)
- Monte Carlo Exploring Starts (in 5.3, p.79)
- Contant-alpha Monte Carlo
- Tabular Temporal Difference (in 6.1, p.96)
- Sarsa: An on-policy TD control algorithm (in 6.4, p.104)
- Q-learning: An off-policy TD control algorithm (in 6.5, p.105)
- Expected Sarsa (in 6.6, p.107)
- Double Sarsa, Double Expected Sarsa, Double Q-Learning (in 6.7, p.109)
- n-step Temporal Difference for estimating V(s) (in 7.1, p.115)
- n-step Sarsa, n-step Expected Sarsa, n-step Q-Learning (in 7.2, p.118)
- Random-sample one-step tabular Q-planning (parallel, in 8.1, p.131)
- Tabular Dyna-Q (in 8.2, p.133)
- Prioritized Sweeping (in 8.4, p.137)
- Semi-gradient Tabular Temporal Difference (in 9.3, p.164)
- True Online Sarsa (in 12.8, p.309)
Prisoner's Dilemma |
Exact Gambler |
AV-Iteration q(s,a) |
TabularQPlan |
Monte Carlo |
Q-Learning |
Expected-Sarsa |
Sarsa |
3-step Q-Learning |
3-step E-Sarsa |
3-step Sarsa |
OTrue Online Sarsa |
ETrue Online Sarsa |
QTrue Online Sarsa |
Value Iteration v(s)
Value Iteration v(s)
Action Value Iteration and optimal policy
Monte Carlo q(s,a) |
ESarsa q(s,a) |
QLearning q(s,a) |
Monte Carlo Exploring Starts
AV-Iteration |
TabularQPlan |
Q-Learning |
E-Sarsa |
Sarsa |
Monte Carlo |
paths obtained using value iteration
track 1 |
track 2 |
Action Value Iteration |
TabularQPlan |
Action Value Iteration |
Q-Learning |
TabularQPlan |
Expected Sarsa |
Action Value Iteration |
Prioritized sweeping |
Exact expected reward of two adversarial optimistic agents depending on their initial configuration:
Exact expected reward of two adversarial Upper-Confidence-Bound agents depending on their initial configuration:
Specify dependency
and repository
of the tensor library in the pom.xml
file of your maven project:
<dependencies>
<dependency>
<groupId>ch.ethz.idsc</groupId>
<artifactId>subare</artifactId>
<version>0.3.8</version>
</dependency>
</dependencies>
<repositories>
<repository>
<id>subare-mvn-repo</id>
<url>https://raw.github.com/idsc-frazzoli/subare/mvn-repo/</url>
<snapshots>
<enabled>true</enabled>
<updatePolicy>always</updatePolicy>
</snapshots>
</repository>
</repositories>
The source code is attached to every release.
Jan Hakenberg, Christian Fluri
- Learning to Operate a Fleet of Cars by Christian Fluri, Claudio Ruch, Julian Zilly, Jan Hakenberg, and Emilio Frazzoli
- Reinforcement Learning: An Introduction by Richard S. Sutton, and Andrew G. Barto