內容簡介
數據科學的關鍵技術包括數據存儲計算、數據治理、結構化數據分析、語音分析、視精促埋覺分析、文本分析和知識圖譜等方面。本書的重點是詳細介紹文本分析和知識圖譜方洪滲歸朽面的技術。文本分析技術主要包括文本預訓練模型、多語種文本分析、文本情感分析、文本機器翻譯、文本智慧型糾錯、NL2SQL問答以及ChatGPT大語言模型等。知識圖譜技術主要包括知識圖譜構建拔旋和知識圖譜問答等。本書將理論介紹和實踐相結合,詳細闡述各個技術主題的實現路線,並對套用於業界算法大賽中的技術方案和技巧進行原始碼解讀,幫助讀者深入理解技術原理。最後,本書還介紹了文本分析和知識圖譜技術在政務、公共安全、應急等多個行業中的智慧型套用實踐案例。
圖書目錄
第1章 什麼是數據科學鑽煮戀套 ························1
1.1 數據科學的定義 ···························1
1.1.1 數據科學的背景 ···················1
1.1.2 數據科學的定義 ···················1
1.2 數據科學的關鍵技術 ·····················3
1.2.1 數據存儲計算 ·····················5
1.2.2 數據治理 ························· 12
1.2.3 結構化數據分析 ················ 28
1.2.4 語音分析 ························· 44
1.2.5 視覺分析 ··························55
1.2.6 文本分析 ··························61
1.2.7 知識圖譜 ························· 65
1.3 本章小結 ·································· 65
1.4 習套主恥題 ········································ 66
1.5 本章參考文廈糠膠獻 ···························· 66
第2章 文本預訓練模型······················ 68
2.1 文本分析技術的發展史 ················ 68
2.2 Transformer模型結構 ·················· 70
2.3 預訓練模型的結構和變種 ··············75
2.4 加速處理器GPU和TPU ················ 79
2.4.1 GPU的介紹 ······················ 79
2.4.2 GPU產品命名 ··················· 80
2.4.3 TPU和GPU的區別 ·············· 83
2.4.4 TPU的使用總結 ················· 84
2.5 預訓練模型的常見問題 ················· 87
2.5.1 模型輸入的常見問題 ··········· 87
2.5.2 模型原理的常見問題 ··········· 90
2.5.3 模型進化的常見問題 ··········· 94
2.6 預訓練模型的源碼解讀 ················ 96
2.6.1 模型架構 ························· 96
2.6.2 BertModel ························ 96
2.6.3 BERT預訓練任務 ·············· 107
2.6.4 BERT 微調 ······················ 112
2.7 本章小結 ································· 114
2.8 習題 ······································· 114
2.9 本章參考文獻 ··························· 115
第3章 多語種文本分析 ·····················116
3.1 多語種文本分析背景介紹 ············· 116
3.2 多語種文本分析技術 ··················· 116
3.2.1 Polyglot技術只遷員 ···················· 116
3.2.2 Multilingual BERT ············ 117
3.2.3 XLM多語言模型 ··············· 117
3.2.4 XLMR多語言模型 ············· 119
3.2.5 模型實驗效果 ·················· 120
3.3 多語種文本分析源碼解讀 ············· 121
3.4 本章小結 ································· 125
3.5 習題 ······································· 126
3.6 本章參考文獻 ··························· 126
第4章 文本情感分析 ························127
4.1 情感分析背景介紹 ····················· 127
4.2 情感分析技術 ··························· 127
4.2.1 目標和挑戰 ····················· 127
4.2.2 技術發展歷程 ·················· 129
4.2.3 情感分析的需求分析 ·········· 133
4.2.4 情感分析的落地實踐 ·········· 134
4.2.5 模型開發平台的構建 ·········· 137
4.3 情感分析比賽和方案 ·················· 144
4.3.1 背景介紹 ························ 144
4.3.2 方案介紹 ························ 146
4.3.3 數據清洗和增廣 ··············· 147
4.3.4 多模態融合 ····················· 147
4.3.5 機器學習技巧 ·················· 148
4.4 情感分析源碼解讀 ····················· 151
4.4.1 F1值適應最佳化技巧代碼 ······· 151
4.4.2 對抗訓練代碼 ·················· 152
4.5 本章小結 ································· 154
4.6 習題 ······································· 154
4.7 本章參考文獻 ··························· 155
第5章 文本機器翻譯 ·······················156
5.1 機器翻譯背景介紹 ····················· 156
5.2 機器翻譯技術 ··························· 157
5.2.1 基於規則的機器翻譯 ·········· 157
5.2.2 統計機器翻譯 ·················· 158
5.2.3 神經網路機器翻譯 ············ 159
5.2.4 Encoder-Decoder模型 ········· 161
5.2.5 注意力機制模型 ··············· 162
5.2.6 工業級神經網路實踐 ·········· 164
5.3 機器翻譯比賽和方案 ·················· 167
5.3.1 WMT21翻譯任務 ·············· 167
5.3.2 WMT22 翻譯任務 ············· 168
5.4 機器翻譯源碼解讀 ····················· 169
5.4.1 通用框架介紹 ·················· 169
5.4.2 翻譯模型實現 ·················· 170
5.5 本章小結 ································· 180
5.6 習題 ······································· 181
5.7 本章參考文獻 ··························· 181
第6章 文本智慧型糾錯 ·······················183
6.1 文本糾錯背景介紹 ····················· 183
6.2 文本智慧型糾錯技術 ····················· 184
6.2.1 智慧型糾錯的意義和難點 ······· 185
6.2.2 智慧型糾錯解決的問題 ·········· 185
6.2.3 業界主流解決方案 ············ 186
6.2.4 技術方案實踐 ·················· 190
6.3 文本智慧型糾錯技術 ···················· 193
6.3.1 比賽介紹 ························ 193
6.3.2 校對問題思考 ·················· 194
6.4 糾錯方案和源碼解讀 ·················· 195
6.4.1 GECToR原理解讀 ············· 195
6.4.2 MacBERT原理解讀 ··········· 199
6.4.3 PERT原理解讀 ·················200
6.4.4 PLOME原理解讀 ··············202
6.4.5 比賽方案 ························ 203
6.5 本章小結 ·································204
6.6 習題 ······································· 205
6.7 本章參考文獻 ··························· 205
第7章 知識圖譜構建 ······················ 206
7.1 知識圖譜背景介紹 ·····················206
7.1.1 知識和知識圖譜 ················206
7.1.2 知識獲取、知識抽取與信息抽取
的區別····························207
7.1.3 知識圖譜構建範式 ·············208
7.2 非結構化信息抽取技術 ··············· 211
7.2.1 信息抽取框架 ··················· 211
7.2.2 命名實體識別··················· 212
7.2.3 關係識別 ························ 213
7.2.4 事件抽取 ························ 215
7.3 生成式統一模型抽取技術 ············ 216
7.4 模型源碼解讀 ···························220
7.5 本章小結 ·································224
7.6 習題 ·······································224
7.7 本章參考文獻 ··························· 225
第8章 知識圖譜問答 ······················ 226
8.1 背景介紹 ································· 226
8.2 知識圖譜問答技術 ····················· 229
8.2.1 信息檢索方法 ·················· 229
8.2.2 語義解析方法 ·················· 231
8.3 方案和源碼解讀 ························ 233
8.3.1 NL2SPARQL ··················· 233
8.3.2 NL2SPARQL語義解析方案 ··· 234
8.3.3 T5、BART、UniLM模型簡介 ··· 234
8.3.4 T5、BART、UniLM方案 ······ 236
8.3.5 訓練T5、BART、UniLM
生成模型 ······················· 237
8.3.6 語義排序方案和代碼 ·········· 239
8.3.7 SPARQL修正代碼 ············· 241
8.4 本章小結 ································· 245
8.5 習題 ······································· 245
第9章 結構化知識NL2SQL問答 ·········246
9.1 NL2SQL背景介紹 ······················246
9.2 NL2SQL技術 ··························· 249
9.2.1 NL2SQL技術路線 ············· 249
9.2.2 NL2SQL項目實踐 ············· 255
9.3 NL2SQL比賽和方案 ··················· 256
9.4 NL2SQL源碼解讀 ······················ 259
9.5 本章小結 ································· 269
9.6 習題 ······································· 269
9.7 本章參考文獻 ··························· 270
第10章 ChatGPT大語言模型 ·············271
10.1 ChatGPT介紹 ·························· 271
10.1.1 ChatGPT的定義和背景 ······ 271
10.1.2 ChatGPT的發展歷程 ········· 272
10.2 GPT模型概述·························· 272
10.2.1 GPT-1模型的原理 ············ 272
10.2.2 GPT-2模型的原理 ············ 273
10.2.3 GPT-3模型的原理 ············ 275
10.3 ChatGPT的實現原理 ················· 277
10.3.1 大模型的微調技術 ··········· 277
10.3.2 ChatGPT的能力來源 ········ 278
10.3.3 ChatGPT的預訓練和微調 ··· 279
10.4 ChatGPT的套用 ······················· 282
10.4.1 ChatGPT提示工程 ··········· 282
10.4.2 ChatGPT套用場景 ··········· 283
10.4.3 ChatGPT的優缺點 ···········284
10.5 開源大模型 ···························· 285
10.5.1 ChatGLM大模型 ············· 285
10.5.2 LLaMA大模型 ················ 288
10.6 本章小結································ 294
10.7 習題······································ 294
10.8 本章參考文獻·························· 295
第11章 行業實踐案例 ····················· 296
11.1 智慧政務實踐案例 ···················· 296
11.1.1 案例背景 ······················· 296
11.1.2 解決方案 ······················· 297
11.1.3 系統架構和實現 ·············· 299
11.1.4 案例總結 ······················· 307
11.2 公共安全實踐案例 ····················308
11.2.1 案例背景 ·······················308
11.2.2 解決方案 ·······················309
11.2.3 系統架構及實現 ·············· 311
11.2.4 案例總結 ······················· 317
11.3 智慧型應急實踐案例 ···················· 318
11.3.1 案例背景 ······················· 319
11.3.2 解決方案 ······················· 320
11.3.3 系統架構及實現 ·············· 321
11.3.4 案例總結 ······················· 332
11.4 本章小結 ································ 334
11.5 習題······································ 334
2.4.2 GPU產品命名 ··················· 80
2.4.3 TPU和GPU的區別 ·············· 83
2.4.4 TPU的使用總結 ················· 84
2.5 預訓練模型的常見問題 ················· 87
2.5.1 模型輸入的常見問題 ··········· 87
2.5.2 模型原理的常見問題 ··········· 90
2.5.3 模型進化的常見問題 ··········· 94
2.6 預訓練模型的源碼解讀 ················ 96
2.6.1 模型架構 ························· 96
2.6.2 BertModel ························ 96
2.6.3 BERT預訓練任務 ·············· 107
2.6.4 BERT 微調 ······················ 112
2.7 本章小結 ································· 114
2.8 習題 ······································· 114
2.9 本章參考文獻 ··························· 115
第3章 多語種文本分析 ·····················116
3.1 多語種文本分析背景介紹 ············· 116
3.2 多語種文本分析技術 ··················· 116
3.2.1 Polyglot技術 ···················· 116
3.2.2 Multilingual BERT ············ 117
3.2.3 XLM多語言模型 ··············· 117
3.2.4 XLMR多語言模型 ············· 119
3.2.5 模型實驗效果 ·················· 120
3.3 多語種文本分析源碼解讀 ············· 121
3.4 本章小結 ································· 125
3.5 習題 ······································· 126
3.6 本章參考文獻 ··························· 126
第4章 文本情感分析 ························127
4.1 情感分析背景介紹 ····················· 127
4.2 情感分析技術 ··························· 127
4.2.1 目標和挑戰 ····················· 127
4.2.2 技術發展歷程 ·················· 129
4.2.3 情感分析的需求分析 ·········· 133
4.2.4 情感分析的落地實踐 ·········· 134
4.2.5 模型開發平台的構建 ·········· 137
4.3 情感分析比賽和方案 ·················· 144
4.3.1 背景介紹 ························ 144
4.3.2 方案介紹 ························ 146
4.3.3 數據清洗和增廣 ··············· 147
4.3.4 多模態融合 ····················· 147
4.3.5 機器學習技巧 ·················· 148
4.4 情感分析源碼解讀 ····················· 151
4.4.1 F1值適應最佳化技巧代碼 ······· 151
4.4.2 對抗訓練代碼 ·················· 152
4.5 本章小結 ································· 154
4.6 習題 ······································· 154
4.7 本章參考文獻 ··························· 155
第5章 文本機器翻譯 ·······················156
5.1 機器翻譯背景介紹 ····················· 156
5.2 機器翻譯技術 ··························· 157
5.2.1 基於規則的機器翻譯 ·········· 157
5.2.2 統計機器翻譯 ·················· 158
5.2.3 神經網路機器翻譯 ············ 159
5.2.4 Encoder-Decoder模型 ········· 161
5.2.5 注意力機制模型 ··············· 162
5.2.6 工業級神經網路實踐 ·········· 164
5.3 機器翻譯比賽和方案 ·················· 167
5.3.1 WMT21翻譯任務 ·············· 167
5.3.2 WMT22 翻譯任務 ············· 168
5.4 機器翻譯源碼解讀 ····················· 169
5.4.1 通用框架介紹 ·················· 169
5.4.2 翻譯模型實現 ·················· 170
5.5 本章小結 ································· 180
5.6 習題 ······································· 181
5.7 本章參考文獻 ··························· 181
第6章 文本智慧型糾錯 ·······················183
6.1 文本糾錯背景介紹 ····················· 183
6.2 文本智慧型糾錯技術 ····················· 184
6.2.1 智慧型糾錯的意義和難點 ······· 185
6.2.2 智慧型糾錯解決的問題 ·········· 185
6.2.3 業界主流解決方案 ············ 186
6.2.4 技術方案實踐 ·················· 190
6.3 文本智慧型糾錯技術 ···················· 193
6.3.1 比賽介紹 ························ 193
6.3.2 校對問題思考 ·················· 194
6.4 糾錯方案和源碼解讀 ·················· 195
6.4.1 GECToR原理解讀 ············· 195
6.4.2 MacBERT原理解讀 ··········· 199
6.4.3 PERT原理解讀 ·················200
6.4.4 PLOME原理解讀 ··············202
6.4.5 比賽方案 ························ 203
6.5 本章小結 ·································204
6.6 習題 ······································· 205
6.7 本章參考文獻 ··························· 205
第7章 知識圖譜構建 ······················ 206
7.1 知識圖譜背景介紹 ·····················206
7.1.1 知識和知識圖譜 ················206
7.1.2 知識獲取、知識抽取與信息抽取
的區別····························207
7.1.3 知識圖譜構建範式 ·············208
7.2 非結構化信息抽取技術 ··············· 211
7.2.1 信息抽取框架 ··················· 211
7.2.2 命名實體識別··················· 212
7.2.3 關係識別 ························ 213
7.2.4 事件抽取 ························ 215
7.3 生成式統一模型抽取技術 ············ 216
7.4 模型源碼解讀 ···························220
7.5 本章小結 ·································224
7.6 習題 ·······································224
7.7 本章參考文獻 ··························· 225
第8章 知識圖譜問答 ······················ 226
8.1 背景介紹 ································· 226
8.2 知識圖譜問答技術 ····················· 229
8.2.1 信息檢索方法 ·················· 229
8.2.2 語義解析方法 ·················· 231
8.3 方案和源碼解讀 ························ 233
8.3.1 NL2SPARQL ··················· 233
8.3.2 NL2SPARQL語義解析方案 ··· 234
8.3.3 T5、BART、UniLM模型簡介 ··· 234
8.3.4 T5、BART、UniLM方案 ······ 236
8.3.5 訓練T5、BART、UniLM
生成模型 ······················· 237
8.3.6 語義排序方案和代碼 ·········· 239
8.3.7 SPARQL修正代碼 ············· 241
8.4 本章小結 ································· 245
8.5 習題 ······································· 245
第9章 結構化知識NL2SQL問答 ·········246
9.1 NL2SQL背景介紹 ······················246
9.2 NL2SQL技術 ··························· 249
9.2.1 NL2SQL技術路線 ············· 249
9.2.2 NL2SQL項目實踐 ············· 255
9.3 NL2SQL比賽和方案 ··················· 256
9.4 NL2SQL源碼解讀 ······················ 259
9.5 本章小結 ································· 269
9.6 習題 ······································· 269
9.7 本章參考文獻 ··························· 270
第10章 ChatGPT大語言模型 ·············271
10.1 ChatGPT介紹 ·························· 271
10.1.1 ChatGPT的定義和背景 ······ 271
10.1.2 ChatGPT的發展歷程 ········· 272
10.2 GPT模型概述·························· 272
10.2.1 GPT-1模型的原理 ············ 272
10.2.2 GPT-2模型的原理 ············ 273
10.2.3 GPT-3模型的原理 ············ 275
10.3 ChatGPT的實現原理 ················· 277
10.3.1 大模型的微調技術 ··········· 277
10.3.2 ChatGPT的能力來源 ········ 278
10.3.3 ChatGPT的預訓練和微調 ··· 279
10.4 ChatGPT的套用 ······················· 282
10.4.1 ChatGPT提示工程 ··········· 282
10.4.2 ChatGPT套用場景 ··········· 283
10.4.3 ChatGPT的優缺點 ···········284
10.5 開源大模型 ···························· 285
10.5.1 ChatGLM大模型 ············· 285
10.5.2 LLaMA大模型 ················ 288
10.6 本章小結································ 294
10.7 習題······································ 294
10.8 本章參考文獻·························· 295
第11章 行業實踐案例 ····················· 296
11.1 智慧政務實踐案例 ···················· 296
11.1.1 案例背景 ······················· 296
11.1.2 解決方案 ······················· 297