內容簡介
理論學習的目標是使學生掌握複雜數據的分析與建模;方法學習的目標是使學生能夠按照實證研究的規範和數據挖掘的步驟進行大數據研發,工具學習的目標是使學生熟練掌握一種數據分析的語言。本書內容由 10章構成:大數據分析概述,數據挖掘流程,有指導的學習,無指導的學習,貝葉斯分類和因果學習,高維回歸及變數選擇,圖模型,客戶關係管理、社會網路分析、自然語言模型和文本挖掘。
本書可用做統計學、管理學、計算機科學等專業進行數據挖掘、機器學習、人工智慧等相關課程的本科高年級、研究生教材或教學參考書。
圖書目錄
第 1章 大數據分析概述 ....................................................................................................................1
1.1 大數據概述 ...........................................................................................................................1
1.1.1 什麼是大數據 ..........................................................................................................1
1.1.2 數據、信息與認知 ..................................................................................................2
1.1.3 數據管理與資料庫 ..................................................................................................5
1.1.4 數據倉庫 ..................................................................................................................7
1.1.5 數據挖掘的內涵和基本特徵 ..................................................................................9
1.2 數據挖掘的產生與功能 .....................................................................................................10
1.2.1 數據挖掘的歷史 ....................................................................................................10
1.2.2 數據挖掘的功能 ....................................................................................................12
1.3 數據挖掘與相關領域之間的關係 .....................................................................................13
1.3.1 數據挖掘與機器學習 ............................................................................................14
1.3.2 數據挖掘與數據倉庫 ............................................................................................14
1.3.3 數據挖掘與統計學 ................................................................................................15
1.3.4 數據挖掘與智慧型決策 ............................................................................................16
1.3.5 數據挖掘與雲計算 ................................................................................................17
1.4 大數據研究方法 .................................................................................................................18
1.5 討論題目 .............................................................................................................................19
1.6 推薦閱讀 .............................................................................................................................20
第 2章 數據挖掘流程 ......................................................................................................................22
2.1 數據挖掘流程概述 .............................................................................................................22
2.1.1 問題識別 ................................................................................................................23
2.1.2 數據理解 ................................................................................................................25
2.1.3 數據準備 ................................................................................................................26
2.1.4 建立模型 ................................................................................................................27
2.1.5 模型評價 ................................................................................................................27
2.1.6 部署套用 ................................................................................................................30
2.2 離群點發現 .........................................................................................................................30
2.2.1 基於統計的離群點檢測 ........................................................................................31
2.2.2 基於距離的離群點檢測 ........................................................................................32
2.2.3 局部離群點算法 ....................................................................................................34
2.3 不平衡數據級聯算法 .........................................................................................................36
2.4 討論題目 .............................................................................................................................41
2.5 推薦閱讀 .............................................................................................................................43
第 3章 有指導的學習 ......................................................................................................................45
3.1 有指導的學習概述 .............................................................................................................45
3.2 k-近鄰..................................................................................................................................49
3.3 決策樹 .................................................................................................................................51
3.3.1 決策樹的基本概念 ................................................................................................51
3.3.2 分類回歸樹 ............................................................................................................53
3.3.3 決策樹的剪枝 ........................................................................................................54
3.4 提升方法 .............................................................................................................................58
3.5 隨機森林樹 .........................................................................................................................63
3.5.1 隨機森林樹算法的定義 ........................................................................................64
3.5.2 如何確定隨機森林樹算法中樹的節點分裂變數 ................................................64
3.5.3 隨機森林樹的回歸算法 ........................................................................................65
3.6 人工神經網路 .....................................................................................................................68
3.6.1 人工神經網路基本概念 ........................................................................................68
3.6.2感知器算法 ............................................................................................................69
3.6.3 LMS算法...............................................................................................................72
3.6.4 反向傳播算法 ........................................................................................................74
3.6.5 神經網路相關問題討論 ........................................................................................79
3.7 支持向量機 .........................................................................................................................83
3.7.1 最大邊距分類 ........................................................................................................84
3.7.2 支持向量機問題的求解 ........................................................................................85
3.7.3 支持向量機的核方法 ............................................................................................87
3.8 多元自適應回歸樣條 .........................................................................................................91
3.9 討論題目 .............................................................................................................................93
3.10推薦閱讀 ...........................................................................................................................95
第 4章 無指導的學習 ......................................................................................................................97
4.1關聯規則 .............................................................................................................................97
4.1.1靜態關聯規則算法 Apriori算法 ..........................................................................98
4.1.2動態關聯規則算法 Carma算法..........................................................................102
4.1.3 序列規則挖掘算法 ..............................................................................................104
4.2聚類分析 ...........................................................................................................................106
4.2.1 聚類分析的含義及作用 ......................................................................................106
4.2.2 距離的定義 ..........................................................................................................106
4.2.3 系統層次聚類法 ..................................................................................................108
4.2.4 k-均值算法 ...........................................................................................................108
4.2.5 BIRCH算法......................................................................................................... 110
4.2.6 基於密度的聚類算法 .......................................................................................... 111
4.3基於預測強度的聚類方法 ............................................................................................... 113
4.3.1 預測強度 .............................................................................................................. 115
4.3.2 預測強度方法的套用 .......................................................................................... 115
4.3.3 案例分析 .............................................................................................................. 115
4.4 聚類問題的變數選擇 .......................................................................................................122
4.4.1 高斯成對罰模型聚類 ..........................................................................................122
4.4.2 各類異方差成對罰模型聚類 ..............................................................................123
4.4.3 幾種聚類變數選擇的比較 ..................................................................................127
4.5 討論題目 ...........................................................................................................................128
4.6 推薦閱讀 ...........................................................................................................................129
第 5章 貝葉斯分類和因果學習 ....................................................................................................130
5.1 貝葉斯分類 .......................................................................................................................130
5.2 決策論與統計決策論 .......................................................................................................132
5.2.1 決策與風險 ..........................................................................................................132
5.2.2 統計決策 ..............................................................................................................136
5.3 線性判別函式和二次判別函式 .......................................................................................138
5.4 樸素貝葉斯分類 ...............................................................................................................143
5.5 貝葉斯網路 .......................................................................................................................145
5.5.1 基本概念 ..............................................................................................................145
5.5.2 貝葉斯網路的套用 ..............................................................................................146
5.5.3 貝葉斯網路的構建 ..............................................................................................148
5.6 案例:貝葉斯網路模型在信用卡違約機率建模中的套用 ............................................155
5.7 討論題目 ...........................................................................................................................157
5.8 推薦閱讀 ...........................................................................................................................160
第 6章 高維回歸及變數選擇 ........................................................................................................161
6.1 線性回歸模型 ...................................................................................................................161
6.2 模型選擇 ...........................................................................................................................173
6.2.1 模型選擇概述 ......................................................................................................174
6.2.2 偏差-方差分解.....................................................................................................179
6.2.3 模型選擇準則 ......................................................................................................180
6.2.4 回歸變數選擇 ......................................................................................................184
6.3 廣義線性模型 ...................................................................................................................188
6.3.1 二點分布回歸 ......................................................................................................188
6.3.2 指數族機率分布 ..................................................................................................190
6.3.3 廣義線性模型 ......................................................................................................192
6.3.4 模型估計 ..............................................................................................................193
6.3.5 模型檢驗與診斷 ..................................................................................................194
6.4 高維回歸係數壓縮 ...........................................................................................................202
6.4.1 嶺回歸 ..................................................................................................................203
6.4.2 LASSO.................................................................................................................204
6.4.3 Shooting算法.......................................................................................................205
6.4.4 路徑算法 ..............................................................................................................207
6.4.5 其他懲罰項及 Oracle性質 ................................................................................. 211
6.4.6 軟體實現 ..............................................................................................................213
6.5 總結................................................................214
6.6 討論題目 ...........................................................................................................................214
6.7 推薦閱讀 ...........................................................................................................................216
第 7章 圖模型 ................................................................................................................................217
7.1 圖模型基本概念和性質 ...................................................................................................218
7.1.1 圖矩陣 ..................................................................................................................220
7.1.2 機率圖模型概念和性質 ......................................................................................220
7.2 協方差選擇 .......................................................................................................................222
7.2.1 用回歸估計圖模型 ..............................................................................................222
7.2.2 基於最大似然框架的方法 ..................................................................................225
7.3 指數族圖模型 ...................................................................................................................229
7.3.1 基本定義 ..............................................................................................................229
7.3.2 參數估計及假設檢驗 ..........................................................................................231
7.4 譜聚類 ...............................................................................................................................234
7.4.1 聚類和圖劃分 ......................................................................................................234
7.4.2 譜聚類 ..................................................................................................................235
7.5 總結....................................................242
7.6 討論題目 ...........................................................................................................................242
7.7 推薦閱讀 ...........................................................................................................................243
第 8章 客戶關係管理 ....................................................................................................................245
8.1 協同推薦模型 ...................................................................................................................245
8.1.1 基於鄰域的算法 ..................................................................................................246
8.1.2 矩陣分解模型 ......................................................................................................249
8.2 客戶價值隨機模型 ...........................................................................................................252
8.2.1 客戶價值的定義 ..................................................................................................252
8.2.2 客戶價值分析模型 ..............................................................................................253
8.2.3 客戶購買狀態轉移矩陣 ......................................................................................254
8.2.4 利潤矩陣 ..............................................................................................................257
8.2.5 客戶價值的計算 ..................................................................................................259
8.3 案例:銀行卡消費客戶價值模型 ...................................................................................259
8.4 推薦閱讀 ...........................................................................................................................265
第 9章 社會網路分析 ....................................................................................................................266
9.1 社會網路概述 ...................................................................................................................266
9.1.1 社會網路概念與發展 ..........................................................................................266
9.1.2 社會網路的基本特徵 ..........................................................................................269
9.1.3 社群挖掘算法 ......................................................................................................271
9.1.4 模型的評價 ..........................................................................................................272
9.2 案例:社會網路在學術機構合作關係上的研究 ...........................................................273
9.3討論題目 ...........................................................................................................................278
9.4推薦閱讀 ...........................................................................................................................278
附錄 A 本章 R程式 ...............................................................................................................279
第 10章 自然語言模型和文本挖掘 ..............................................................................................281
10.1向量空間模型 .................................................................................................................282
10.1.1向量空間模型基本概念 ..................................................................................282
10.1.2特徵選擇準則 ..................................................................................................283
10.2統計語言模型 .................................................................................................................284
10.2.1 n-gram模型 .....................................................................................................284
10.2.2 主題 n-元模型..................................................................................................286
10.3 LDA模型........................................................................................................................287
10.4 案例: LDA模型的熱點新聞發現 ................................................................................290
10.5推薦閱讀 ....................................................................................................................293