內容簡介
智慧型體AlphaGo戰勝人類圍棋專家刷新了人類對人工智慧的認識,虹戲危也使得其核心技術強化學習受到學術界的廣泛關注。本書正是在如此背景下,圍繞作者多年從事強院促企化學習理論及套用的研究內容及國內外關於強化學習的近動態等方面展開介紹,是為數不多的強化學習領域的專業著作。該著作側重於基於直接策略搜尋套潤妹的強化學習方法,結合了統計學習的諸多方法對相關技術及方法進行分析、改進及套用。本書以一個全新的現代角度描述策略搜尋強芝舉化學習算法。從不同的強化學習場景出發,講述了強化學習在實際套用中所面臨的諸多難題。針對不同場景,給定具體的策略搜尋算法,分析算法中估計量和學習參數的統計特性,並對算法進行套用實例展示及定量比較。特別地,本書結合強化學習前沿技術將策略搜尋算法套用到機器人控制及數字藝術悼重禁體渲染領域,給人以判刪組台耳目一新的感覺。後根據作者長期研究經驗,對強化學習的發展趨勢進行了簡要介紹和總結。本書取材經典、全面,概念清楚,推導嚴密,以期形成一個集基礎理論、算法和套用為一體的完備知識體系。
作者簡介
趙婷婷,天津科技大學人工智慧學院副教授,主要研究方向為人工智慧、機器學習。中國計算機協會(CCF) 會員、YOCSEF 會員、中國人工智慧學會會員、人工智慧學會模式識別專委會委員,2017年獲得天津市"131”創新型人才培養工程第二層次人選稱號。
圖書目錄
第1章 強化學牛再鞏習概述···························································································1
1.1 機器學習中的強化學習··········································································1
1.2 智慧型控制中的強化學習··········································································4
1.3 強化學習分支··························································································8
1.4 本書貢獻·······························································································11
1.5 本書結構·······························································································12
參考文獻········································································································14
第2章 相關研究及背景知識·············································································19
2.1 馬爾可夫決策過程················································································19
2.2 基於值函式的策略學習算法·································································21
2.2.1 值函式·······················································································21
2.2.2 策略疊代和值疊代····································································23
2.2.3 Q-learning ··················································································25
2.2.4 基於小二乘法的策略疊代算法·············································27
2.2.5 基於值函式的深度強化學習方法·············································29
2.3 策略搜尋算法························································································30
2.3.1 策略搜尋算法建模····································································31
2.3.2 傳統策略梯度算法(REINFORCE算法)······························32
2.3.3 自然策略梯度方法(Natural Policy Gradient)························33
2.3.4 期望化的策略搜尋方法·····················································35
2.3.5 基於策略的深度強化學習方法·················································37
2.4 本章小結·······························································································38
參考文獻········································································································39
第3章 策略梯度估計的分析與改進·································································42
3.1 研究背景·······························································································42
3.2 基於參數探索的策略梯度算法(PGPE算法)···································44
3.3 梯度估計方差分析················································································46
3.4 基於基線的算法改進及分析·························································48
3.4.1 基線的基本思想································································48
3.4.2 PGPE算法的基線······························································49
3.5 實驗·······································································································51
3.5.1 示例···························································································51
3.5.2 倒立擺平衡問題········································································57
3.6 總結與討論····························································································58
參考文獻········································································································60
第4章 基於重要性採樣的參數探索策略梯度算法··········································63
4.1 研究背景·······························································································63
4.2 異策略場景下的PGPE算法·································································64
4.2.1 重要性加權PGPE算法·····························································65
4.2.2 IW-PGPE算法通過基線減法減少方差····································66
4.3 實驗結果·······························································································68
4.3.1 示例···························································································69
4.3.2 山地車任務················································································78
4.3.3 機器人仿真控制任務································································81
4.4 總結和討論····························································································88
參考文獻·····························
第2章 相關研究及背景知識·············································································19
2.1 馬爾可夫決策過程················································································19
2.2 基於值函式的策略學習算法·································································21
2.2.1 值函式·······················································································21
2.2.2 策略疊代和值疊代····································································23
2.2.3 Q-learning ··················································································25
2.2.4 基於小二乘法的策略疊代算法·············································27
2.2.5 基於值函式的深度強化學習方法·············································29
2.3 策略搜尋算法························································································30
2.3.1 策略搜尋算法建模····································································31
2.3.2 傳統策略梯度算法(REINFORCE算法)······························32
2.3.3 自然策略梯度方法(Natural Policy Gradient)························33
2.3.4 期望化的策略搜尋方法·····················································35
2.3.5 基於策略的深度強化學習方法·················································37
2.4 本章小結·······························································································38
參考文獻········································································································39
第3章 策略梯度估計的分析與改進·································································42
3.1 研究背景·······························································································42
3.2 基於參數探索的策略梯度算法(PGPE算法)···································44
3.3 梯度估計方差分析················································································46
3.4 基於基線的算法改進及分析·························································48
3.4.1 基線的基本思想································································48
3.4.2 PGPE算法的基線······························································49
3.5 實驗·······································································································51
3.5.1 示例···························································································51
3.5.2 倒立擺平衡問題········································································57
3.6 總結與討論····························································································58
參考文獻········································································································60
第4章 基於重要性採樣的參數探索策略梯度算法··········································63
4.1 研究背景·······························································································63
4.2 異策略場景下的PGPE算法·································································64
4.2.1 重要性加權PGPE算法·····························································65
4.2.2 IW-PGPE算法通過基線減法減少方差····································66
4.3 實驗結果·······························································································68
4.3.1 示例···························································································69
4.3.2 山地車任務················································································78
4.3.3 機器人仿真控制任務································································81
4.4 總結和討論····························································································88
參考文獻·····························