《Spark核心設計的藝術:架構設計與實現》由多位專家聯袂推薦,360大數據專家撰寫,基於Spark 2.1.0剖析架構與實現精髓。細化到方法級,提煉出多個流程圖,立體呈現架構、環境、調度、存儲、計算、部署、API七大核心設計。
本書特色:
按照源碼分析的習慣設計,從腳本分析到初始化,再到核心內容。整個過程遵循由淺入深的基本思路。 每一章先對本章的內容有個總體介紹,然後深入分析各個組件的實現原理,最後將各個組件之間的關係通過執行流程來展現。本書儘可能地用圖來展示原理,以加速讀者對內容的掌握。本書講解的很多實現及原理都值得借鑑,可以幫助讀者提升架構設計、程式設計等方面的能力。本書儘可能保留較多的源碼,以便於初學者能夠在脫離辦公環境的地方(如捷運、公交等),也能輕鬆閱讀。
基本介紹
- 書名:Spark核心設計的藝術
- 又名:Spark核心設計的藝術:架構設計與實現
- 作者:耿嘉安
- ISBN:978-7-111-58439-1
- 頁數:690
- 定價:139
- 出版社:機械工業出版社
- 出版時間:2018-01-01
- 裝幀:平裝
- 開本:16開
- 技術範疇:大數據
- 外文名:The Art of Spark Kernel Design
內容簡介,作者簡介,圖書目錄,
內容簡介
《Spark核心設計的藝術:架構設計與實現》一書基於Spark 2.1.0對架構與實現的精髓進行剖析,旨在為Spark的最佳化、定製和擴展提供原理性的指導。
本書一共有10章內容,主要包括以下部分。
準備部分(第1~2章):簡單介紹了Spark的環境搭建和基本原理。本部分通過詳盡的描述,有效降低了讀者進入Spark世界的門檻,同時能對Spark背景知識及整體設計有巨觀的認識。
基礎部分(第3~5章):介紹Spark的基礎設施(包括配置、RPC、度量等)、SparkContext的初始化、Spark執行所需要的環境等內容。經過此部分的學習,將能夠對RPC框架的設計、執行環境的功能有深入的理解,這也是對核心內容了解的前提。
核心部分(第6~9章):為Spark最核心的部分,包括存儲體系、調度系統、計算引擎、部署模式等。通過本部分的學習,讀者將充分了解Spark的數據處理體系細節,能夠對Spark核心功能進行擴展、性能最佳化以及對線上問題進行精準排查。
API部分(第10章):這部分主要對Spark的新老API進行對比,對新API進行簡單介紹。
準備部分(第1~2章):簡單介紹了Spark的環境搭建和基本原理。本部分通過詳盡的描述,有效降低了讀者進入Spark世界的門檻,同時能對Spark背景知識及整體設計有巨觀的認識。
基礎部分(第3~5章):介紹Spark的基礎設施(包括配置、RPC、度量等)、SparkContext的初始化、Spark執行所需要的環境等內容。經過此部分的學習,將能夠對RPC框架的設計、執行環境的功能有深入的理解,這也是對核心內容了解的前提。
核心部分(第6~9章):為Spark最核心的部分,包括存儲體系、調度系統、計算引擎、部署模式等。通過本部分的學習,讀者將充分了解Spark的數據處理體系細節,能夠對Spark核心功能進行擴展、性能最佳化以及對線上問題進行精準排查。
API部分(第10章):這部分主要對Spark的新老API進行對比,對新API進行簡單介紹。
作者簡介
耿嘉安,10餘年IT行業相關經驗。先後就職於阿里巴巴、藝龍、360,專注於開源和大數據領域。在大量的工作實踐中,對J2EE、JVM、Tomcat、Spring、Hadoop、Spark、MySQL、Redis都有深入研究,尤其喜歡剖析開源項目的源碼實現。早期從事J2EE企業級套用開發,對Java相關技術有獨到見解。著有《深入理解Spark:核心思想與源碼分析》一書。
圖書目錄
本書讚譽
前言
第1章 環境準備 ········································1
1.1 運行環境準備 ···········································2
1.1.1 安裝JDK ·········································2
1.1.2 安裝Scala ········································2
1.1.3 安裝Spark ·······································3
1.2 Spark初體驗 ···································4
1.2.1 運行spark-shell ·······························4
1.2.2 執行word count ······························5
1.2.3 剖析spark-shell ·······························9
1.3 閱讀環境準備 ·········································14
1.3.1 安裝SBT ·······································15
1.3.2 安裝Git ·········································15
1.3.3 安裝Eclipse Scala IDE外掛程式 ········15
1.4 Spark源碼編譯與調試 ·························17
1.5 小結 ···························23
第2章 設計理念與基本架構 ···············24
2.1 初識Spark ··································25
2.1.1 Hadoop MRv1的局限···················25
2.1.2 Spark的特點 ·································26
2.1.3 Spark使用場景 ·····························28
2.2 Spark基礎知識 ······································29
2.3 Spark基本設計思想 ·····························31
2.3.1 Spark模組設計 ·····························32
2.3.2 Spark模型設計 ·····························34
2.4 Spark基本架構 ···································36
2.5 小結 ·································38
第3章 Spark基礎設施 ·························39
3.1 Spark配置 ········································40
3.1.1 系統屬性中的配置 ·······················40
3.1.2 使用SparkConf配置的API ·········41
3.1.3 克隆SparkConf配置 ····················42
3.2 Spark內置RPC框架 ····························42
3.2.1 RPC配置TransportConf ··············45
3.2.2 RPC客戶端工廠Transport- ClientFactory ·······················47
3.2.3 RPC服務端TransportServer ········53
3.2.4 管道初始化 ···································56
3.2.5 TransportChannelHandler詳解 ·····57
3.2.6 服務端RpcHandler詳解 ··············63
3.2.7 服務端引導程式Transport-ServerBootstrap ·····················68
3.2.8 客戶端TransportClient詳解 ········71
3.3 事件匯流排 ····································78
3.3.1 ListenerBus的繼承體系 ···············79
3.3.2 SparkListenerBus詳解 ··················80
3.3.3 LiveListenerBus詳解 ····················83
3.4 度量系統 ···········································87
3.4.1 Source繼承體系 ···························87
3.4.2 Sink繼承體系 ·······························89
3.5 小結 ·········································92
第4章 SparkContext的初始化 ·········93
4.1 SparkContext概述 ·································94
4.2 創建Spark環境 ·····································97
4.3 SparkUI的實現 ····································100
4.3.1 SparkUI概述 ·······························100
4.3.2 WebUI框架體系 ·························102
4.3.3 創建SparkUI ·······························107
4.4 創建心跳接收器 ··································111
4.5 創建和啟動調度系統··························112
4.6 初始化塊管理器BlockManager ·······114
4.7 啟動度量系統 ·······························114
4.8 創建事件日誌監聽器··························115
4.9 創建和啟動ExecutorAllocation-Manager ··························116
4.10 ContextCleaner的創建與啟動 ········120
4.10.1 創建ContextCleaner ·················120
4.10.2 啟動ContextCleaner ·················120
4.11 額外的SparkListener與啟動事件匯流排 ··························122
4.12 Spark環境更新 ··································123
4.13 SparkContext初始化的收尾 ···········127
4.14 SparkContext提供的常用方法 ·······128
4.15 SparkContext的伴生對象················130
4.16 小結 ····································131
第5章 Spark執行環境 ························132
5.1 SparkEnv概述 ·································133
5.2 安全管理器SecurityManager ············133
5.3 RPC環境 ·········································135
5.3.1 RPC端點RpcEndpoint ···············136
5.3.2 RPC端點引用RpcEndpointRef ···139
5.3.3 創建傳輸上下文TransportConf ···142
5.3.4 訊息調度器Dispatcher ···············142
5.3.5 創建傳輸上下文Transport-Context ·························154
5.3.6 創建傳輸客戶端工廠Transport-ClientFactory ····················159
5.3.7 創建TransportServer ···················160
5.3.8 客戶端請求傳送 ·························162
5.3.9 NettyRpcEnv中的常用方法 ·······173
5.4 序列化管理器SerializerManager ·····175
5.5 廣播管理器BroadcastManager ·········178
5.6 map任務輸出跟蹤器 ··························185
5.6.1 MapOutputTracker的實現 ··········187
5.6.2 MapOutputTrackerMaster的實現原理 ·······················191
5.7 構建存儲體系 ·······································199
5.8 創建度量系統 ·······································201
5.8.1 MetricsCon?g詳解 ·····················203
5.8.2 MetricsSystem中的常用方法 ····207
5.8.3 啟動MetricsSystem ····················209
5.9 輸出提交協調器 ··································211
5.9.1 OutputCommitCoordinator-Endpoint的實現 ··················211
5.9.2 OutputCommitCoordinator的實現 ··························212
5.9.3 OutputCommitCoordinator的工作原理 ························216
5.10 創建SparkEnv ····································217
5.11 小結 ·····································217
第6章 存儲體系 ·····································219
6.1 存儲體系概述 ·······································220
6.1.1 存儲體系架構 ·····························220
6.1.2 基本概念 ·····································222
6.2 Block信息管理器 ································227
6.2.1 Block鎖的基本概念 ···················227
6.2.2 Block鎖的實現 ···························229
6.3 磁碟Block管理器 ······························234
6.3.1 本地目錄結構 ·····························234
6.3.2 DiskBlockManager提供的方法 ···························236
6.4 磁碟存儲DiskStore ·····························239
6.5 記憶體管理器 ·····································242
6.5.1 記憶體池模型 ·································243
6.5.2 StorageMemoryPool詳解 ···········244
6.5.3 MemoryManager模型 ················247
6.5.4 Uni?edMemoryManager詳解 ····250
6.6 記憶體存儲MemoryStore ······················252
6.6.1 MemoryStore的記憶體模型 ··········253
6.6.2 MemoryStore提供的方法 ··········255
6.7 塊管理器BlockManager ····················265
6.7.1 BlockManager的初始化 ·············265
6.7.2 BlockManager提供的方法 ·········266
6.8 BlockManagerMaster對Block-Manager的管理 ·················285
6.8.1 BlockManagerMaster的職責 ······285
6.8.2 BlockManagerMasterEndpoint詳解 ·································286
6.8.3 BlockManagerSlaveEndpoint詳解 ·····························289
6.9 Block傳輸服務 ····································290
6.9.1 初始化NettyBlockTransfer-Service ···························291
6.9.2 NettyBlockRpcServer詳解 ·········292
6.9.3 Shuf?e客戶端 ·····························296
6.10 DiskBlockObjectWriter詳解 ···········305
6.11 小結 ·······································308
第7章 調度系統 ·····································309
7.1 調度系統概述 ·······································310
7.2 RDD詳解 ·····································312
7.2.1 為什麼需要RDD ························312
7.2.2 RDD實現的初次分析 ················313
7.2.3 RDD依賴 ····································316
7.2.4 分區計算器Partitioner················318
7.2.5 RDDInfo ······································320
7.3 Stage詳解 ········································321
7.3.1 ResultStage的實現 ·····················322
7.3.2 Shuf?eMapStage的實現 ·············323
7.3.3 StageInfo ······································324
7.4 面向DAG的調度器DAGScheduler ···326
7.4.1 JobListener與JobWaiter ·············326
7.4.2 ActiveJob詳解 ····························328
7.4.3 DAGSchedulerEventProcessLoop的簡要介紹 ·······················328
7.4.4 DAGScheduler的組成 ················329
7.4.5 DAGScheduler提供的常用方法 ···330
7.4.6 DAGScheduler與Job的提交 ····334
7.4.7 構建Stage····································337
7.4.8 提交ResultStage ························341
7.4.9 提交還未計算的Task ·················343
7.4.10 DAGScheduler的調度流程 ······347
7.4.11 Task執行結果的處理 ··············348
7.5 調度池Pool ······································351
7.5.1 調度算法 ·······························352
7.5.2 Pool的實現 ·································354
7.5.3 調度池構建器 ·····························357
7.6 任務集合管理器TaskSetManager ···363
7.6.1 Task集合 ·····································363
7.6.2 TaskSetManager的成員屬性 ······364
7.6.3 調度池與推斷執行 ·····················366
7.6.4 Task本地性 ·································370
7.6.5 TaskSetManager的常用方法 ······373
7.7 運行器後端接口LauncherBackend ···383
7.7.1 BackendConnection的實現 ········384
7.7.2 LauncherBackend的實現 ···········386
7.8 調度後端接口SchedulerBackend ····389
7.8.1 SchedulerBackend的定義 ··········389
7.8.2 LocalSchedulerBackend的實現分析 ································390
7.9 任務結果獲取器TaskResultGetter ···394
7.9.1 處理成功的Task ·························394
7.9.2 處理失敗的Task ·························396
7.10 任務調度器TaskScheduler ··············397
7.10.1 TaskSchedulerImpl的屬性 ·····397
7.10.2 TaskSchedulerImpl的初始化 ···399
7.10.3 TaskSchedulerImpl的啟動 ·····399
7.10.4 TaskSchedulerImpl與Task的提交 ·······················400
7.10.5 TaskSchedulerImpl與資源分配 ···························402
7.10.6 TaskSchedulerImpl的調度流程 ······························405
7.10.7 TaskSchedulerImpl對執行結果的處理 ·····························406
7.10.8 TaskSchedulerImpl的常用方法 ···409
7.11 小結 ·······································412
第8章 計算引擎 ·····································413
8.1 計算引擎概述 ·······································414
8.2 記憶體管理器與執行記憶體 ·····················417
8.2.1 ExecutionMemoryPool詳解 ·······417
8.2.2 MemoryManager模型與執行記憶體 ··························420
8.2.3 Uni?edMemoryManager與執行記憶體 ·······················421
8.3 記憶體管理器與Tungsten ·····················423
8.3.1 MemoryBlock詳解 ·····················423
8.3.2 MemoryManager模型與Tungsten ···························425
8.3.3 Tungsten的記憶體分配器 ··············425
8.4 任務記憶體管理器 ··································431
8.4.1 TaskMemoryManager詳解 ·········431
8.4.2 記憶體消費者 ·······················439
8.4.3 執行記憶體整體架構 ·····················441
8.5 Task詳解 ······································443
8.5.1 任務上下文TaskContext ············443
8.5.2 Task的定義 ·································446
8.5.3 Shuf?eMapTask的實現 ··············449
8.5.4 ResultTask的實現 ·······················450
8.6 IndexShuf?eBlockResolver詳解 ······451
8.7 採樣與估算 ···········································455
8.7.1 SizeTracker的實現分析 ·············455
8.7.2 SizeTracker的工作原理 ·············457
8.8 特質WritablePartitionedPair- Collection ······················458
8.9 AppendOnlyMap的實現分析 ···········460
8.9.1 AppendOnlyMap的容量增長 ····461
8.9.2 AppendOnlyMap的數據更新 ····462
8.9.3 AppendOnlyMap的快取聚合算法 ·····························464
8.9.4 AppendOnlyMap的內置排序 ····466
8.9.5 AppendOnlyMap的擴展 ············467
8.10 PartitionedPairBuffer的實現分析 ···469
8.10.1 PartitionedPairBuffer的容量增長 ······················469
8.10.2 PartitionedPairBuffer的插入 ···470
8.10.3 PartitionedPairBuffer的疊代器 ···471
8.11 外部排序器 ·········································472
8.11.1 ExternalSorter詳解 ·················473
8.11.2 Shuf?eExternalSorter詳解 ······487
8.12 Shuf?e管理器 ····································490
8.12.1 Shuf?eWriter詳解 ··················491
8.12.2 Shuf?eBlockFetcherIterator詳解 ······························502
8.12.3 BlockStoreShuf?eReader詳解 ···510
8.12.4 SortShuf?eManager詳解 ········513
8.13 map端與reduce端的Shuf?e組合 ······························516
8.14 小結 ·········································519
第9章 部署模式 ········································520
9.1 心跳接收器HeartbeatReceiver ·········521
9.2 Executor的實現分析 ··························527
9.2.1 Executor的心跳報告 ··················528
9.2.2 運行Task ·····································530
9.3 local部署模式 ······································535
9.4 持久化引擎PersistenceEngine ··········537
9.4.1 基於檔案系統的持久化引擎 ·····539
9.4.2 基於ZooKeeper的持久化引擎 ···541
9.5 領導選舉代理 ·······································542
9.6 Master詳解 ···········································546
9.6.1 啟動Master ·································549
9.6.2 檢查Worker逾時························553
9.6.3 被選舉為領導時的處理 ·············554
9.6.4 一級資源調度 ·····························558
9.6.5 註冊Worker·································568
9.6.6 更新Worker的最新狀態············570
9.6.7 處理Worker的心跳····················570
9.6.8 註冊Application··························571
9.6.9 處理Executor的申請 ·················573
9.6.10 處理Executor的狀態變化 ·······573
9.6.11 Master的常用方法 ···················574
9.7 Worker詳解 ································578
9.7.1 啟動Worker·································581
9.7.2 向Master註冊Worker ···············584
9.7.3 向Master傳送心跳 ····················589
9.7.4 Worker與領導選舉·····················591
9.7.5 運行Driver ··································593
9.7.6 運行Executor ······························594
9.7.7 處理Executor的狀態變化 ·········599
9.8 StandaloneAppClient實現 ·················600
9.8.1 ClientEndpoint的實現分析 ········601
9.8.2 StandaloneAppClient的實現分析 ······························606
9.9 StandaloneSchedulerBackend的實現分析 ························607
9.9.1 StandaloneSchedulerBackend的屬性 ····························607
9.9.2 DriverEndpoint的實現分析 ·······609
9.9.3 StandaloneSchedulerBackend的啟動 ··························614
9.9.4 StandaloneSchedulerBackend的停止 ·························617
9.9.5 StandaloneSchedulerBackend與資源分配 ················618
9.10 CoarseGrainedExecutorBackend詳解 ····························619
9.10.1 CoarseGrainedExecutorBackend進程 ··························620
9.10.2 CoarseGrainedExecutorBackend的功能分析 ·························622
9.11 local-cluster部署模式 ·······················625
9.11.1 啟動本地集群 ····························625
9.11.2 local-cluster部署模式的啟動過程 ·································627
9.11.3 local-cluster部署模式下Executor的分配過程 ·················628
9.11.4 local-cluster部署模式下的任務提交執行過程 ····························629
9.12 Standalone部署模式 ·························631
9.12.1 Standalone部署模式的啟動過程 ························632
9.12.2 Standalone部署模式下Executor的分配過程 ················634
9.12.3 Standalone部署模式的資源回收 ·····························635
9.12.4 Standalone部署模式的容錯機制 ······························636
9.13 其他部署方案 ·····································639
9.13.1 YARN·········································639
9.13.2 Mesos ·········································644
9.14 小結 ·······································646
第10章 Spark API ································647
10.1 基本概念·····································648
10.2 數據源DataSource ····························650
10.2.1 DataSourceRegister詳解 ··········650
10.2.2 DataSource詳解 ························651
10.3 檢查點的實現 ···································655
10.3.1 CheckpointRDD的實現············655
10.3.2 RDDCheckpointData的實現 ····660
10.3.3 ReliableRDDCheckpointData的實現 ························662
10.4 RDD的再次分析 ·······························663
10.4.1 轉換API ····································663
10.4.2 動作API ····································665
10.4.3 檢查點API的實現分析 ···········667
10.4.4 疊代計算 ···································669
10.5 數據集合Dataset ·······························671
10.6 DataFrameReader詳解 ·····················673
10.7 SparkSession詳解 ·····························676
10.7.1 SparkSession的構建器Builder ···676
10.7.2 SparkSession的API ·················679
10.8 word count例子 ·································679
10.8.1 Job準備階段 ·····························680
10.8.2 Job的提交與調度 ·····················685
10.9 小結 ········································689
附錄 ···········································690