內容簡介
本書作為《大數據導論》(ISBN 9787302500704)的配套實訓教材,旨在幫助讀者夯實基礎知識,還原企業真實業務,提升實操能力。本書從大數據開發所需要的基礎編程知識出發,首先闡述 Linux 開發環境中常用的命令。接著介紹數據清洗工具 Kettle 的基礎操作以及常見的數據可視化效果,如餅圖、柱狀圖、折線圖、平行坐標圖等。最後通過數據清洗、數據可視化、數據挖掘等熱門大數據技術在環境、金融、電商等行業的具體套用,給讀者提供真實的大數據體驗情景。 本書提供了豐富的項目實訓案例,結合實際情況進行真實的行業數據研究,從而培養實用型人才的專業項目能力。
圖書目錄
目 錄
第一篇 Linux 入門
實訓1 檔案的創建、訪問、修改、刪除 ................................................... 2
1.1 實訓目的 ······················································································ 2
1.2 實訓要求 ······················································································ 2
1.3 實訓原理 ······················································································ 2
1.4 實訓步驟 ······················································································ 3
1.5 實訓結果 ······················································································ 6
實訓2 檔案的創建、查看、內容修改 ....................................................... 8
2.1 實訓目的 ······················································································ 8
2.2 實訓要求 ······················································································ 8
2.3 實訓原理 ······················································································ 8
2.4 實訓步驟 ······················································································ 9
2.5 實訓結果 ······················································································ 9
實訓3 文本編輯常用技巧:複製、貼上、刪除 ....................................... 12
3.1 實訓目的 ···················································································· 12
3.2 實訓要求 ···················································································· 12
3.3 實訓原理 ···················································································· 12
3.4 實訓步驟 ···················································································· 15
3.5 實訓結果 ···················································································· 17
第二篇 數據清洗
實訓4 從文本檔案中抽取數據到資料庫 ................................................. 22
4.1 實訓目的 ···················································································· 22
4.2 實訓要求 ···················································································· 22
4.3 實訓原理 ···················································································· 22
4.3.1 Kettle 簡介 ··············································································· 22
4.3.2 從文本檔案中抽取數據到資料庫的方法 ··········································· 23
4.4 實訓步驟 ···················································································· 23
4.4.1 安裝 ······················································································· 23
4.4.2 從文本檔案中抽取數據到資料庫的步驟 ··········································· 26
4.5 實訓結果 ···················································································· 29
實訓5 從CSV 檔案中抽取數據到資料庫 ............................................... 31
5.1 實訓目的 ···················································································· 31
5.2 實訓要求 ···················································································· 31
5.3 實訓原理 ···················································································· 31
5.4 實訓步驟 ···················································································· 32
5.5 實訓結果 ···················································································· 33
實訓6 將Excel 檔案數據導入資料庫 ..................................................... 35
6.1 實訓目的 ···················································································· 35
6.2 實訓要求 ···················································································· 35
6.3 實訓原理 ···················································································· 35
6.4 實訓步驟 ···················································································· 35
6.5 實訓結果 ···················································································· 39
實訓7 將MySQL 數據遷移至MongoDB ............................................... 40
7.1 實訓目的 ···················································································· 40
7.2 實訓要求 ···················································································· 40
7.3 實訓原理 ···················································································· 40
7.4 實訓步驟 ···················································································· 41
7.5 實訓結果 ···················································································· 44
實訓8 資料庫增量數據抽取 ................................................................... 45
8.1 實訓目的 ···················································································· 45
8.2 實訓要求 ···················································································· 45
8.3 實訓原理 ···················································································· 45
8.4 實訓步驟 ···················································································· 46
8.5 實訓結果 ···················································································· 53
實訓9 數據增刪改的增量更新 ................................................................ 54
9.1 實訓目的 ···················································································· 54
9.2 實訓要求 ···················································································· 54
9.3 實訓原理 ···················································································· 54
9.4 實訓步驟 ···················································································· 55
9.5 實訓結果 ···················································································· 60
實訓10 數據脫敏 ................................................................................... 62
10.1 實訓目的 ··················································································· 62
10.2 實訓要求 ··················································································· 62
10.3 實訓原理 ··················································································· 62
10.4 實訓步驟 ··················································································· 63
10.5 實訓結果 ··················································································· 67
實訓11 數據檢驗 ................................................................................... 69
11.1 實訓目的 ··················································································· 69
11.2 實訓要求 ··················································································· 69
11.3 實訓原理 ··················································································· 69
11.4 實訓步驟 ··················································································· 69
11.4.1 設定檢驗規則 ·········································································· 69
11.4.2 非空驗證 ················································································ 71
11.4.3 日期類型驗證 ·········································································· 71
實訓12 缺失值清洗 ................................................................................ 75
12.1 實訓目的 ··················································································· 75
12.2 實訓要求 ··················································································· 75
12.3 實訓原理 ··················································································· 75
12.4 實訓步驟 ··················································································· 75
12.4.1 運行SQL 腳本進行清洗 ····························································· 76
12.4.2 運用控制項進行清洗 ···································································· 77
實訓13 格式內容清洗 ............................................................................ 80
13.1 實訓目的 ··················································································· 80
13.2 實訓要求 ··················································································· 80
13.3 實訓原理 ··················································································· 80
13.4 實訓步驟 ··················································································· 80
13.4.1 對“格式錯誤類型1”進行清洗 ··················································· 80
13.4.2 對“格式錯誤類型2”進行清洗 ··················································· 84
實訓14 邏輯錯誤清洗 ............................................................................ 88
14.1 實訓目的 ··················································································· 88
14.2 實訓要求 ··················································································· 88
14.3 實訓原理 ··················································································· 88
14.4 實訓步驟 ··················································································· 89
14.4.1 對“邏輯錯誤類型1”進行清洗 ··················································· 89
14.4.2 對“邏輯錯誤類型2”進行清洗 ··················································· 92
第三篇 數據可視化
實訓15 餅圖、柱狀圖、折線圖、平行坐標圖繪製 ................................. 98
15.1 實訓目的 ··················································································· 98
15.2 實訓要求 ··················································································· 98
15.3 實訓原理 ··················································································· 98
15.4 實訓步驟 ················································································· 100
15.4.1 導入數據與模組 ······································································ 100
15.4.2 數據提取 ··············································································· 101
15.4.3 圖形繪製 ··············································································· 101
實訓16 共享腳踏車數據可視化分析 ........................................................ 109
16.1 實訓目的 ················································································· 109
16.2 實訓要求 ················································································· 109
16.3 實訓步驟 ·················································································· 110
16.3.1 數據準備 ··············································································· 110
16.3.2 數據清洗 ··············································································· 111
16.3.3 數據處理 ··············································································· 111
16.3.4 數據挖掘 ··············································································· 112
16.3.5 可視化分析 ············································································ 114
實訓17 小說雲圖繪製 .......................................................................... 120
17.1 實訓目的 ················································································· 120
17.2 實訓要求 ················································································· 120
17.3 實訓原理 ················································································· 120
17.3.1 jieba 分詞 ·············································································· 120
17.3.2 wordcloud 詞雲 ······································································· 120
17.4 實訓步驟 ················································································· 121
17.4.1 導入模組 ··············································································· 121
17.4.2 讀取檔案,設定路徑 ································································ 121
17.4.3 文本分詞 ··············································································· 122
17.4.4 繪製詞雲 ··············································································· 123
實訓18 籃球命中率可視化 ................................................................... 125
18.1 實訓目的 ················································································· 125
18.2 實訓要求 ················································································· 125
18.3 實訓原理 ················································································· 125
18.4 實訓步驟 ················································································· 126
18.4.1 導入模組和數據檔案 ································································ 126
18.4.2 處理數據 ··············································································· 127
18.4.3 可視化分析 ············································································ 128
第四篇 環境大數據實戰
實訓19 二氧化碳含量預測 ................................................................... 136
19.1 實訓目的 ················································································· 136
19.2 實訓要求 ················································································· 136
19.3 實訓原理 ················································································· 137
19.4 實訓步驟 ················································································· 137
19.4.1 導入包並載入數據 ··································································· 137
19.4.2 初始數據可視化 ······································································ 138
19.4.3 ARIMA 時間序列模型 ······························································ 139
19.4.4 ARIMA 時間序列模型的參數選擇 ··············································· 139
19.4.5 配置ARIMA 時間序列模型 ······················································· 140
19.4.6 驗證預測 ··············································································· 142
19.4.7 生成和可視化預測 ··································································· 145
實訓20 新加坡空氣污染原因分析 ........................................................ 146
20.1 實訓目的 ················································································· 146
20.2 實訓要求 ················································································· 146
20.3 實訓原理 ················································································· 146
20.4 實訓步驟 ················································································· 147
20.4.1 數據準備 ··············································································· 147
20.4.2 驗證假設1:製造業的增加將導致新加坡的空氣污染增加 ················· 148
XII 大數據導論技術實訓
20.4.3 驗證假設2:建築房屋數量的增加將導致新加坡的空氣污染增加 ········ 151
20.4.4 驗證假設3:車輛數量的增加將導致新加坡的空氣污染增加 ·············· 157
實訓21 上海歷史天氣統計 ................................................................... 160
21.1 實訓目的 ················································································· 160
21.2 實訓要求 ················································································· 160
21.3 實訓原理 ················································································· 160
21.4 實訓步驟 ················································································· 161
21.4.1 編寫Mapper 程式 ···································································· 161
21.4.2 編寫Reducer 程式 ··································································· 162
21.4.3 統計上海2016 年每月歷史天氣 ·················································· 162
實訓22 上海每月空氣品質統計 ............................................................ 164
22.1 實訓目的 ················································································· 164
22.2 實訓要求 ················································································· 164
22.3 實訓原理 ················································································· 164
22.4 實訓步驟 ················································································· 165
22.4.1 編寫Mapper 程式 ···································································· 165
22.4.2 編寫Reducer 程式 ··································································· 165
22.4.3 統計上海2016 年每月空氣品質 ·················································· 166
實訓23 北京和上海月均氣溫對比統計 ................................................. 168
23.1 實訓目的 ················································································· 168
23.2 實訓要求 ················································································· 168
23.3 實訓原理 ················································································· 168
23.4 實訓步驟 ················································································· 168
23.4.1 編寫Mapper 程式 ···································································· 168
23.4.2 編寫Reducer 程式 ··································································· 169
23.4.3 統計北京和上海2016 年月平均氣溫對比 ······································· 170
第五篇 金融大數據實戰
實訓24 最優投資組合(上) ............................................................... 172
24.1 實訓目的 ················································································· 172
24.2 實訓要求 ················································································· 172
24.3 實訓原理 ················································································· 172
24.4 實訓步驟 ················································································· 173
24.4.1 導入實訓需要的模組 ································································ 173
24.4.2 讀取數據 ··············································································· 173
24.4.3 觀察缺失值 ············································································ 173
24.4.4 數據可視化 ············································································ 174
24.4.5 初步統計分析 ········································································· 175
24.4.6 投資組合最佳化 ········································································· 175
24.4.7 計算組合均值收益率 ································································ 176
24.5 實訓結果 ················································································· 177
實訓25 最優投資組合(下) ............................................................... 179
25.1 實訓目的 ················································································· 179
25.2 實訓要求 ················································································· 179
25.3 實訓原理 ················································································· 179
25.4 實訓步驟 ················································································· 180
25.4.1 最大夏普比率投資組合 ····························································· 180
25.4.2 最小方差投資組合 ··································································· 181
25.4.3 畫散點圖 ··············································································· 182
25.5 實訓結果 ················································································· 182
實訓26 股票走勢預測 .......................................................................... 184
26.1 實訓目的 ················································································· 184
26.2 實訓要求 ················································································· 184
26.3 實訓原理 ················································································· 184
26.4 實訓步驟 ················································································· 185
26.4.1 導入模組 ··············································································· 185
26.4.2 ARIMA 模型建立 ···································································· 185
26.4.3 數據差分 ··············································································· 186
26.4.4 自相關圖和偏自相關圖 ····························································· 187
26.4.5 模型訓練 ··············································································· 188
26.5 實訓結果 ················································································· 188
第六篇 商業大數據實戰
實訓27 電商產品評論數據情感分析 ..................................................... 192
27.1 實訓目的 ················································································· 192
27.2 實訓要求 ················································································· 192
XIV 大數據導論技術實訓
27.3 實訓原理 ················································································· 192
27.4 實訓步驟 ················································································· 193
27.4.1 評論數據抽取 ········································································· 193
27.4.2 評論文本去重 ········································································· 193
27.4.3 模型準備 ··············································································· 194
27.4.4 刪除前綴評分 ········································································· 194
27.4.5 文本分詞 ··············································································· 195
27.4.6 模型構建 ··············································································· 196
27.5 實訓結果 ················································································· 197
實訓28 eBay 汽車銷售數據分析 .......................................................... 198
28.1 實訓目的 ················································································· 198
28.2 實訓要求 ················································································· 198
28.3 實訓原理 ················································································· 199
28.3.1 數據標準化 ············································································ 199
28.3.2 數據可視化 ············································································ 199
28.4 實訓步驟 ················································································· 199
28.4.1 數據載入和描述 ······································································ 199
28.4.2 數據剖析 ··············································································· 200
28.4.3 預處理 ·················································································· 202
28.4.4 可視化分析 ············································································ 204
28.5 實訓結果 ················································································· 219
實訓29 航空公司客戶價值分析 ............................................................ 220
29.1 實訓目的 ················································································· 220
29.2 實訓要求 ················································································· 220
29.3 實訓原理 ················································································· 220
29.4 實訓步驟 ················································································· 220
29.4.1 數據準備 ··············································································· 220
29.4.2 數據處理 ··············································································· 221
29.4.3 數據預處理 ············································································ 222
29.4.4 構建模型 ··············································································· 225
29.5 實訓結果 ················································································· 226
實訓30 市場購物籃分析 ...................................................................... 227
30.1 實訓目的 ················································································· 227
30.2 實訓要求 ················································································· 227
30.3 實訓原理 ················································································· 227
30.3.1 MLxtend ················································································ 227
30.3.2 關聯規則 ··············································································· 227
30.3.3 Apriori 算法挖掘頻繁項集 ························································· 228
30.4 實訓步驟 ················································································· 228
30.4.1 用Pandas 和MLxtend 代碼導入並讀取數據 ··································· 228
30.4.2 數據處理 ··············································································· 228
30.4.3 One-Hot 編碼 ·········································································· 229
30.4.4 使用算法包進行關聯規則運算 ···················································· 230
30.4.5 結果檢視 ··············································································· 231
30.4.6 德國流行的組合 ······································································ 231
附錄A 大數據和人工智慧實驗環境 ...................................................... 233
A.1 大數據實驗環境 ········································································· 233
A.2 人工智慧實驗環境 ······································································ 236