update from upstream & make the implement more robust and meaningful in DP/Policy Evaluation Solution #105

liu-jc · 2017-08-24T04:14:46Z

update the description.
make it more robust. I think in the for loop 'for prob, next_state, reward, done in env.P[s][a]:', we need to firstly sum the values from every next states. Because if there is more than one tuple in env.P[s][a], it will return the wrong result. Though it's Insignificant now, due to there is only one tuple whose probability is 1.0 .
According to the slide by David Silver, for all states s, V_{k+1}(s) should be update from V_{k}. So I use a new_V to update V. Maybe it's more reasonable?

dennybritz · 2017-08-27T01:22:25Z

Hi, thank you! I need to look more closely at this in a few days.

In terms of 3., I think both work. Updating immediately tends to converge faster. The slides don't have this, but the book has a proof for it.

liu-jc added 2 commits August 24, 2017 11:54

update from upstream & make the implement more robust and meaningful

6ae6e85

fix something wrong

87901eb

Provide feedback