Spark數據分析：基於Python語言（英文版）

內容簡介

本書重點關注Spark項目的基本知識，從格贈嫌Spark核心開始，然後拓展到各種Spark擴展、Spark相關項目、Spark子項目，以及Spark所處的豐富的生態系統里各種別的開源技術，比如Hadoop、Kafka、Cassandra等。

圖書目錄

第一部分　Spark基礎

第1章　大數據、Hadoop、Spark介紹 3

1.1　大數據、分散式計算、Hadoop簡介 3

1.1.1　大數據與Hadoop簡史 4

1.1.2　Hadoop詳解 5

1.2　Apache Spark簡介 11

1.2.1　Apache Spark背景 11

1.2.2　Spark的用途 12

1.2.3　Spark編程接口 12

1.2.4　Spark程式的提交類型 12

1.2.5　Spark應用程式的輸入輸出類型 14

1.2.6　Spark中的RDD 14

1.2.7　Spark與Hadoop 14

1.3　Python函式式編程 15

1.3.1　Python函式式編程用到的數據結構 15

1.3.2　Python對象序列化 18

1.3.3　Python函式式編程基礎 21

1.4　本章小結 23

第2章　部署Spark 25

2.1　Spark部署模式 25

2.1.1　本地模式 26

2.1.2　Spark獨立集群 26

2.1.3　基於YARN運行Spark 27

2.1.4　基於Mesos運行Spark 28

2.2　準備安裝Spark 28

2.3　獲取Spark 29

2.4　在Linux或Mac OS X上安裝Spark 30

2.5　在Windows上安裝Spark 32

2.6　探索Spark安裝 34

2.7　部署多節點的Spark獨立集群 35

2.8　在雲上部署Spark 37

2.8.1　AWS 37

2.8.2　GCP 39

2.8.3　Databricks 40

2.9　本章小結 41

第3章　理解Spark集群架構 43

3.1　Spark套用中的術語 43

3.1.1　Spark驅動器 44

3.1.2　Spark工作節點與執行器 47

3.1.3　Spark主進程與集群管理器 49

3.2　使用敬企獨立集群的Spark套用 51

3.3　在YARN上運行Spark套用的部署模式 51

3.3.1　客戶端模式 52

3.3.2　集群模式 53

3.3.3　回顧本地模式 54

3.4　本章求槳旬小結 55

第4章　Spark編程基礎 57

4.1　RDD簡介 57

4.2　載入數據到RDD 59

4.2.1　從檔案創建RDD 59

4.2.2　從文本檔案創建RDD的方法 61

4.2.3　從對象檔案創建RDD 64

4.2.4　從數據源創建RDD 64

4.2.5　從JSON檔案創建RDD 67

4.2.6　通過編程創建RDD 69

4.3　RDD操作 70

4.3.1　RDD核灶少承境心概念 70

4.3.2　基本的RDD轉化操作 75

4.3.3　基本的RDD行動操作 79

4.3.4　鍵值對RDD的轉化操作 83

4.3.5　MapReduce與單詞計數練習 90

4.3.6　連線操作 93

4.3.7　在Spark中連線數據集 98

4.3.8　集合操作 101

4.3.9　灑設幾數值型RDD的操作 103

4.4　本章小結 106

第二部分　基礎拓展

第5章　Spark核心API高級編程 109

5.1　Spark中的共享變數 109

5.1.1　廣播變數 110

5.1.2　累加器 114

5.1.3　練習：使用廣播變數和累加祖白邀器 117

5.2　Spark中的數據分區 118

5.2.1　分區概述 118

5.2.2　掌控分區 119

5.2.3　重分區函式 121

5.2.4　針對分區的API方法 123

5.3　RDD的堡姜謎去存儲選項 125

5.3.1　回顧RDD譜系 125

5.3.2　RDD存儲選項 126

5.3.3　RDD快取 129

5.3.4　持久化RDD 129

5.3.5　選擇何時持久化或快取RDD 132

5.3.6　保存RDD檢查點 132

5.3.7　練習：保存RDD檢查點 134

5.4　使用外部程式處理RDD 136

5.5　使用Spark進行數據採樣 137

5.6　理解Spark套用與集群配置 139

5.6.1　Spark環境變數 139

5.6.2　Spark配置屬性 143

5.7　Spark最佳化 146

5.7.1　早過濾，勤過濾 147

5.7.2　最佳化滿足結合律的操作 147

5.7.3　理解函式和閉包的影響 149

5.7.4　收集數據的注意事項 150

5.7.5　使用配置參數調節和最佳化套用 150

5.7.6　避免低效的分區 151

5.7.7 　套用性能問題診斷 153

5.8　本章小結 157

第6章　使用Spark進行SQL與NoSQL編程 159

6.1　Spark SQL簡介 159

6.1.1　Hive簡介 160

6.1.2　Spark SQL架構 164

6.1.3　DataFrame入門 166

6.1.4　使用DataFrame 177

6.1.5　DataFrame快取、持久化與重新分區 185

6.1.6　保存DataFrame輸出 186

6.1.7　訪問Spark SQL 189

6.1.8　練習：使用Spark SQL 192

6.2　在Spark中使用NoSQL系統 193

6.2.1　NoSQL簡介 194

6.2.2　在Spark中使用HBase 195

6.2.3　練習：在Spark中使用HBase 198

6.2.4　在Spark中使用Cassandra 200

6.2.5　在Spark中使用DynamoDB 202

6.2.6　其他NoSQL平台 204

6.3　本章小結 204

第7章　使用Spark處理流數據與訊息 207

7.1　Spark Streaming簡介 207

7.1.1　Spark Streaming架構 208

7.1.2　DStream簡介 209

7.1.3　練習：Spark Streaming入門 216

7.1.4　狀態操作 217

7.1.5　滑動視窗操作 219

7.2　結構化流處理 221

7.2.1　結構化流處理數據源 222

7.2.2　結構化流處理的數據輸出池 223

7.2.3　輸出模式 224

7.2.4　結構化流處理操作 225

7.3　在Spark中使用訊息系統 226

7.3.1　Apache Kafka 227

7.3.2　練習：在Spark中使用Kafka 232

7.3.3　亞馬遜Kinesis 235

7.4　本章小結 238

第8章　Spark數據科學與機器學習簡介 241

8.1　Spark與R語言 241

8.1.1　R語言簡介 242

8.1.2　通過R語言使用Spark 248

8.1.3　練習：在RStudio中使用SparkR 255

8.2　Spark機器學習 257

8.2.1　機器學習基礎 257

8.2.2　使用Spark MLlib進行機器學習 260

8.2.3　練習：使用Spark MLlib實現推薦器 265

8.2.4　使用Spark ML進行機器學習 269

8.3　利用筆記本使用Spark 273

8.3.1　利用Jupyter（IPython）筆記本使用Spark 273

8.3.2　利用Apache Zeppelin筆記本使用Spark 276

8.4　本章小結 277

Contents

I: Spark Foundations

1 Introducing Big Data, Hadoop, and Spark 3

Introduction to Big Data, Distributed Computing, and Hadoop 3

A Brief History of Big Data and Hadoop 4

Hadoop Explained 5

Introduction to Apache Spark 11

Apache Spark Background 11

Uses for Spark 12

Programming Interfaces to Spark 12

Submission Types for Spark Programs 12

Input/Output Types for Spark Applications 14

The Spark RDD 14

Spark and Hadoop 14

Functional Programming Using Python 15

Data Structures Used in Functional Python Programming 15

Python Object Serialization 18

Python Functional Programming Basics 21

Summary 23

2 Deploying Spark 25

Spark Deployment Modes 25

Local Mode 26

Spark Standalone 26

Spark on YARN 27

Spark on Mesos 28

Preparing to Install Spark 28

Getting Spark 29

Installing Spark on Linux or Mac OS X 30

Installing Spark on Windows 32

Exploring the Spark Installation 34

Deploying a Multi-Node Spark Standalone Cluster 35

Deploying Spark in the Cloud 37

Amazon Web Services (AWS) 37

Google Cloud Platform (GCP) 39

Databricks 40Summary 41

3 Understanding the Spark Cluster Architecture 43

Anatomy of a Spark Application 43

Spark Driver 44

Spark Workers and Executors 47

The Spark Master and Cluster Manager 49

Spark Applications Using the Standalone Scheduler 51

Deployment Modes for Spark Applications Running on YARN 51

Client Mode 52

Cluster Mode 53

Local Mode Revisited 54

Summary 55

4 Learning Spark Programming Basics 57

Introduction to RDDs 57

Loading Data into RDDs 59

Creating an RDD from a File or Files 59

Methods for Creating RDDs from a Text File or Files 61

Creating an RDD from an Object File 64

Creating an RDD from a Data Source 64

Creating RDDs from JSON Files 67

Creating an RDD Programmatically 69

Operations on RDDs 70

Key RDD Concepts 70

Basic RDD Transformations 75

Basic RDD Actions 79

Transformations on PairRDDs 83

MapReduce and Word Count Exercise 90

Join Transformations 93

Joining Datasets in Spark 98

Transformations on Sets 101

Transformations on Numeric RDDs 103

Summary 106

II: Beyond the Basics

5 Advanced Programming Using the Spark Core API 109

Shared Variables in Spark 109

Broadcast Variables 110

Accumulators 114

Exercise: Using Broadcast Variables and Accumulators 117

Partitioning Data in Spark 118

Partitioning Overview 118

Controlling Partitions 119

Repartitioning Functions 121

Partition-Specific or Partition-Aware API Methods 123

RDD Storage Options 125

RDD Lineage Revisited 125

RDD Storage Options 126

RDD Caching 129

Persisting RDDs 129

Choosing When to Persist or Cache RDDs 132

Checkpointing RDDs 132

Exercise: Checkpointing RDDs 134

Processing RDDs with External Programs 136

Data Sampling with Spark 137

Understanding Spark Application and Cluster Configuration 139

Spark Environment Variables 139

Spark Configuration Properties 143

Optimizing Spark 146

Filter Early, Filter Often 147

Optimizing Associative Operations 147

Understanding the Impact of Functions and Closures 149

Considerations for Collecting Data 150

Configuration Parameters for Tuning and Optimizing Applications 150

Avoiding Inefficient Partitioning 151

Diagnosing Application Performance Issues 153

Summary 157

6 SQL and NoSQL Programming with Spark 159

Introduction to Spark SQL 159

Introduction to Hive 160Spark SQL Architecture 164

Getting Started with DataFrames 166

Using DataFrames 177

Caching, Persisting, and Repartitioning DataFrames 185

Saving DataFrame Output 186

Accessing Spark SQL 189

Exercise: Using Spark SQL 192

Using Spark with NoSQL Systems 193

Introduction to NoSQL 194

Using Spark with HBase 195

Exercise: Using Spark with HBase 198

Using Spark with Cassandra 200

Using Spark with DynamoDB 202

Other NoSQL Platforms 204

Summary 204

7 Stream Processing and Messaging Using Spark 207

Introducing Spark Streaming 207

Spark Streaming Architecture 208

Introduction to DStreams 209

Exercise: Getting Started with Spark Streaming 216

State Operations 217

Sliding Window Operations 219

Structured Streaming 221

Structured Streaming Data Sources 222

Structured Streaming Data Sinks 223

Output Modes 224

Structured Streaming Operations 225

Using Spark with Messaging Platforms 226

Apache Kafka 227

Exercise: Using Spark with Kafka 232

Amazon Kinesis 235

Summary 238

8 Introduction to Data Science and Machine Learning Using Spark 241

Spark and R 241

Introduction to R 242

Using Spark with R 248

Exercise: Using RStudio with SparkR 255

Machine Learning with Spark 257

Machine Learning Primer 257

Machine Learning Using Spark MLlib 260

Exercise: Implementing a Recommender Using Spark MLlib 265

Machine Learning Using Spark ML 269

Using Notebooks with Spark 273

Using Jupyter (IPython) Notebooks with Spark 273

Using Apache Zeppelin Notebooks with Spark 276

Summary 277

Spark數據分析：基於Python語言（英文版）

基本介紹

內容簡介

圖書目錄

相關詞條

熱門詞條