強化學習與最優控制

強化學習與最優控制

《強化學習與最優控制》是2020年清華大學出版社出版的圖書,作者是[美]德梅萃·P. 博賽卡斯(Dimitri P. Bertsekas)。本書討論依賴於近似的解決方法,以產生具有足夠性能的次優策略。

基本介紹

  • 中文名:強化學習與最優控制
  • 作者:[美]德梅萃·P. 博賽卡斯(Dimitri P. Bertsekas)
  • 出版社:清華大學出版社
  • 出版時間:2020年 
  • 定價:149 元
  • ISBN:9787302540328 
內容簡介,圖書目錄,作者簡介,

內容簡介

本書的目的是考慮大型且具有挑戰性的多階段決策問題,這些問題原則上可以通過動態規劃和控制來解決,但它們的精確解決方案在計算上是難以處理的。本書討論依賴於近似的解決方法,以產生具有足夠性能的次優策略。這些方法統稱為增強學習,也可以叫做近似動態規劃和神經動態規劃等。 本書的主題產生於*控制和人工智慧思想的相互作用。本書的目的之一是探索這兩個領域之間的共同邊界,並架設一座具有任一領域背景的專業人士都可以訪問的橋樑。

圖書目錄

1 Exact Dynamic Programming
1.1 DeterministicDynamicProgramming 2
1.1.1 DeterministicProblems 2
1.1.2 TheDynamicProgrammingAlgorithm 7
1.1.3 Approximation inValue Space 12
1.2 StochasticDynamicProgramming 14
1.3 Examples,Variations, and Simplifications 18
1.3.1 Deterministic ShortestPathProblems 19
1.3.2 DiscreteDeterministicOptimization 21
1.3.3 Problemswith aTermination State 25
1.3.4 Forecasts 26
1.3.5 Problems with Uncontrollable State Components 29
1.3.6 PartialState Information andBelief States 34
1.3.7 LinearQuadraticOptimalControl38
1.3.8 SystemswithUnknownParameters -Adaptive
Control 40
1.4 ReinforcementLearning andOptimalControl - Some
Terminology 43
1.5 Notes and Sources 45
2 Approximation in Value Space
2.1 ApproximationApproaches inReinforcementLearning 50
2.1.1 General Issues ofApproximation inValue Space 54
2.1.2 Off-Line andOn-LineMethods 56
2.1.3 Model-Based Simplification of the Lookahead
Minimization 57
2.1.4 Model-Free off-Line Q-Factor Approximation 58
2.1.5 Approximation inPolicy Space onTop of
ApproximationinValue Space 61
2.1.6 When is Approximation in Value Space Effective? 62
2.2 Multistep Lookahead 64
??ii
viii Contents
2.2.1 Multistep Lookahead and Rolling Horizon 65
2.2.2 Multistep Lookahead and Deterministic Problems 67
2.3 Problem Approximation 69
2.3.1 Enforced Decomposition 69
2.3.2 Probabilistic Approximation - Certainty Equivalent
Control 76
2.4 Rollout and the Policy Improvement Principle 83
2.4.1 On-Line Rollout for Deterministic Discrete
Optimization 84
2.4.2 Stochastic Rollout and Monte Carlo Tree Search 95
2.4.3 Rollout with an Expert 104
2.5 On-Line Rollout for Deterministic Infinite-Spaces Problems -
Optimization Heuristics 106
2.5.1 Model Predictive Control 108
2.5.2 Target Tubes and the Constrained Controllability
Condition 115
2.5.3 Variants of Model Predictive Control 118
2.6 Notes and Sources 120
3 Parametric Approximation
3.1 Approximation Architectures 126
3.1.1 Linear and Nonlinear Feature-Based Architectures 126
3.1.2 Training of Linear and Nonlinear Architectures 134
3.1.3 Incremental Gradient and Newton Methods135
3.2 Neural Networks 149
3.2.1 Training of Neural Networks 153
3.2.2 Multilayer and Deep Neural Networks 157
3.3 Sequential Dynamic Programming Approximation161
3.4 Q-Factor Parametric Approximation 162
3.5 Parametric Approximation in Policy Space by
Classification 165
3.6 Notes and Sources 171
4 Infinite Horizon Dynamic Programming
4.1 An Overview of Infinite Horizon Problems 174
4.2 Stochastic Shortest Path Problems 177
4.3 Discounted Problems 187
4.4 Semi-Markov Discounted Problems 192
4.5 Asynchronous Distributed Value Iteration 197
4.6 Policy Iteration 200
4.6.1 Exact Policy Iteration 200
4.6.2 Optimistic and Multistep Lookahead Policy
Iteration205
4.6.3 Policy Iteration for Q-factors 208
Contents i??
4.7 Notes and Sources 209
4.8 Appendix: MathematicalAnalysis 211
4.8.1 Proofs for Stochastic ShortestPathProblems 212
4.8.2 Proofs forDiscountedProblems217
4.8.3 ConvergenceofExact andOptimistic
Policy Iteration 218
5 Infinite Horizon Reinforcement Learning
5.1 Approximation in Value Space - Performance Bounds 222
5.1.1 LimitedLookahead224
5.1.2 Rollout and Approximate Policy Improvement 227
5.1.3 ApproximatePolicy Iteration 232
5.2 FittedValue Iteration 235
5.3 Simulation-BasedPolicy IterationwithParametric
Approximation239
5.3.1 Self-Learning andActor-CriticMethods 239
5.3.2 Model-Based Variant of a Critic-Only Method 241
5.3.3 Model-FreeVariant of aCritic-OnlyMethod 243
5.3.4 Implementation Issues ofParametricPolicy
Iteration246
5.3.5 Convergence Issues ofParametricPolicy Iteration -
Oscillations 249
5.4 Q-Learning 253
5.4.1 Optimistic Policy Iteration with Parametric Q-Factor
Approximation- SARSAandDQN 255
5.5 AdditionalMethods -TemporalDifferences 256
5.6 Exact andApproximateLinearProgramming 267
5.7 Approximation inPolicy Space 270
5.7.1 Training byCostOptimization -PolicyGradient,
Cross-Entropy,andRandomSearchMethods 276
5.7.2 Expert-BasedSupervisedLearning 286
5.7.3 ApproximatePolicy Iteration,Rollout, and
ApproximationinPolicySpace288
5.8 Notes and Sources 293
5.9 Appendix: MathematicalAnalysis 298
5.9.1 Performance Bounds for Multistep Lookahead 299
5.9.2 Performance Bounds for Rollout301
5.9.3 Performance Bounds for Approximate Policy
Iteration304
6 Aggregation
6.1 AggregationwithRepresentativeStates 308
6.1.1 Continuous State and Control Space Discretization p 314
6.1.2 Continuous State Space - POMDP Discretization 315
?? Contents
6.2 AggregationwithRepresentativeFeatures 317
6.2.1 Hard Aggregation and Error Bounds 320
6.2.2 AggregationUsingFeatures 322
6.3 Methods for Solving theAggregateProblem 328
6.3.1 Simulation-BasedPolicy Iteration 328
6.3.2 Simulation-Based Value Iteration 331
6.4 Feature-BasedAggregationwith aNeuralNetwork 332
6.5 BiasedAggregation 334
6.6 Notes and Sources 337
6.7 Appendix: MathematicalAnalysis 340
References 345
Index 369

作者簡介

Dimitri P. Bertseka,美國MIT終身教授,美國國家工程院院士,清華大學複雜與網路化系統研究中心客座教授。電氣工程與計算機科學領域國際知名作者,著有《非線性規劃》《網路最佳化》《凸最佳化》等十幾本暢銷教材和專著。

相關詞條

熱門詞條

聯絡我們