書籍信息
作 者: [美]Pang-Ning Tan,Michael Steinbach,Vipin Kumar 著
出版時間: 2006-1-1
字 數: 713000
版 次: 1
頁 數: 516
紙 張: 膠版紙
I S B N : 9787115141446
包 裝: 平裝
所屬分類: 圖書 >> 計算機/網路 >> 資料庫 >> 數據倉庫與數據挖掘
定價:¥59.00
編輯推薦
本書全面介紹了數據挖掘,涵蓋了五個主題:數據、分類、
關聯分析、
聚類和
異常檢測。除異常檢測外,每個主題都有兩章:前一章涵蓋基本概念、代表性算法和評估技術,而後一章討論高級概念和算法。這樣讀者在透徹地理解數據挖掘的基礎的同時,還能夠了解更多重要的高級主題。
圖書特色
·與許多其他同類圖書不同,本書將重點放在如何用數據挖掘知識解決各種實際問題。
·只要求具備很少的預備知識——不需要資料庫背景,只需要很少的統計學或數學背景知識。
·書中包含大量的圖表、綜合示例和豐富的習題,並且使用示例、關鍵算法的簡潔描述和習題,儘可能直接地聚集於
數據挖掘的主要概念。
·教輔內容極為豐富,包括課程幻燈片、學生課題建議、數據挖掘資源(如
數據挖掘算法和數據集)、在線上指南(使用實際的數據集和數據分析軟體,為本書介紹的部分數據挖掘技術提供例子講解)。
·為採用本書作為教材的教師提供習題解答。
內容提要
本書對
數據挖掘進行了全面介紹,旨在為讀者提供將數據挖掘套用於實際問題所必需的知識。本書涵蓋五個主題:數據、分類、
關聯分析、
聚類和
異常檢測。除異常檢測外,每個主題都有兩章:前面一章講述基本概念、代表性算法和評估技術,而後面一章較深入地討論高級概念和算法。目的是在使讀者透徹地理解數據挖掘基礎的同時,還能了解更多重要的高級主題。此外,書中還提供了大量例子、圖表和習題。
本書適合作為相關專業高年級本科生和研究生數據挖掘課程的教材,同時也可作為從事數據挖掘研究和套用開發工作的技術人員的參考書。
作者介紹
Michael Steinbach 明尼蘇達大學計算機與工程系研究員,在讀博士。
Vipin Kumar 明尼蘇達大學計算機科學與工程系主任,曾任美國陸軍高性能計算研究中心主任。他擁有
馬里蘭大學博士學位,是數據挖掘和高性能計算方面的國際權威,IEEE會士。
章節目錄
1 Introduction 1
1.1 What Is Data Mining? 2
1.2 Motivating Challenges 3
1.3 The Origins of Data Mining 4
1.4 Data Mining Tasks 5
1.5 Scope and Organization of the Book 8
1.6 Bibliographic Notes 9
1.7 Exercises 12
2 Data 13
2.1 Types of Data 15
2.1.1 Attributes and Measurement 15
2.1.2 Types of Data Sets 20
2.2 Data Quality 25
2.2.1 Measurement and Data Collection Issues 26
2.2.2 Issues Related to Applications 31
2.3 Data Preprocessing 32
2.3.1 Aggregation 32
2.3.2 Sampling 34
2.3.3 Dimensionality Reduction 36
2.3.4 Feature Subset Selection 37
2.3.5 Feature Creation 39
2.3.6 Discretization and Binarization 41
2.3.7 Variable Transformation 45
2.4 Measures of Similarity and Dissimilarity 47
2.4.1 Basics 47
2.4.2 Similarity and Dissimilarity between Simple Attributes 49
2.4.3 Dissimilarities between Data Objects 50
2.4.4 Similarities between Data Objects 52
2.4.5 Examples of Proximity Measures 53
2.4.6 Issues in Proximity Calculation 58
2.4.7 Selecting the Right Proximity Measure 60
2.5 Bibliographic Notes 61
2.6 Exercises 64
3 Exploring Data 71
3.1 The Iris Data Set 71
3.2 Summary Statistics 72
3.2.1 Frequencies and the Mode 72
3.2.2 Percentiles 73
3.2.3 Measures of Location: Mean and Median 73
3.2.4 Measures of Spread: Range and Variance 75
3.2.5 Multivariate Summary Statistics 76
3.2.6 Other Ways to Summarize the Data 77
3.3 Visualization 77
3.3.1 Motivations for Visualization 77
3.3.2 General Concepts 78
3.3.3 Techniques 81
3.3.4 Visualizing Higher-Dimensional Data 90
3.3.5 Do's and Don'ts 94
3.4 OLAP and Multidimensional Data Analysis 95
3.4.1 Representing Iris Data as a Multidimensional Array 95
3.4.2 Multidimensional Data: The General Case 97
3.4.3 Analyzing Multidimensional Data 98
3.4.4 Final Comments on Multidimensional Data Analysis 101
3.5 Bibliographic Notes 102
3.6 Exercises 103
4 Classification: Basic Concepts, Decision Trees, and Model Evaluation 105
4.1 Preliminaries 105
4.2 General Approach to Solving a Classification Problem 107
4.3 Decision Tree Induction 108
4.3.1 How a Decision Tree Works 108
4.3.2 How to Build a Decision Tree 110
4.3.3 Methods for Expressing Attribute Test Conditions 112
4.3.4 Measures for Selecting the Best Split 114
4.3.5 Algorithm for Decision Tree Induction 119
4.3.6 An Example: Web Robot Detection 120
4.3.7 Characteristics of Decision Tree Induction 122
4.4 Model Overfitting 125
4.4.1 Overfitting Due to Presence of Noise 127
4.4.2 Overfitting Due to Lack of Representative Samples 129
4.4.3 Overfitting and the Multiple Comparison Procedure 129
4.4.4 Estimation of Generalization Errors 131
4.4.5 Handling Overfitting in Decision Tree Induction 134
4.5 Evaluating the Performance of a Classifier 135
4.5.1 Holdout Method 136
4.5.2 Random Subsampling 136
4.5.3 Cross-Validation 136
4.5.4 Bootstrap 137
4.6 Methods for Comparing Classifiers 137
4.6.1 Estimating a Confidence Interval for Accuracy 138
4.6.2 Comparing the Performance of Two Models 139
4.6.3 Comparing the Performance of Two Classifiers 140
4.7 Bibliographic Notes 141
4.8 Exercises 144
5 Classification: Alternative Techniques 151
5.1 Rule-Based Classifier 151
5.1.1 How a Rule-Based Classifier Works 153
5.1.2 Rule-Ordering Schemes 154
5.1.3 How to Build a Rule-Based Classifier 155
5.1.4 Direct Methods for Rule Extraction 155
5.1.5 Indirect Methods for Rule Extraction 161
5.1.6 Characteristics of Rule-Based Classifiers 163
5.2 Nearest-Neighbor classifiers 163
5.2.1 Algorithm 165
5.2.2 Characteristics of Nearest-Neighbor Classifiers 165
5.3 Bayesian Classifiers 166
5.3.1 Bayes Theorem 166
5.3.2 Using the Bayes Theorem for Classification 168
5.3.3 Na?ve Bayes Classifier 169
5.3.4 Bayes Error Rate 175
5.3.5 Bayesian Belief Networks 176
5.4 Artificial Neural Network (ANN) 181
5.4.1 Perceptron 181
5.4.2 Multilayer Artificial Neural Network 184
5.4.3 Characteristics of ANN 187
5.5 Support Vector Machine (SVM) 188
5.5.1 Maximum Margin Hyperplanes 188
5.5.2 Linear SVM: Separable Case 190
5.5.3 Linear SVM: Nonseparable Case 195
5.5.4 Nonlinear SVM 198
5.5.5 Characteristics of SVM 203
5.6 Ensemble Methods 203
5.6.1 Rationale for Ensemble Method 203
5.6.2 Methods for Constructing an Ensemble Classifier 204
5.6.3 Bias-Variance Decomposition 206
5.6.4 Bagging 209
5.6.5 Boosting 211
5.6.6 Random Forests 215
5.6.7 Empirical Comparison among Ensemble Methods 216
5.7 Class Imbalance Problem 217
5.7.1 Alternative Metrics 218
5.7.2 The Receiver Operating Characteristic Curve 220
5.7.3 Cost-Sensitive Learning 223
5.7.4 Sampling-Based Approaches 225
5.8 Multiclass Problem 226
5.9 Bibliographic Notes 228
5.10 Exercises 233
6 Association Analysis: Basic Concepts and Algorithms 241
6.1 Problem Definition 242
6.2 Frequent Itemset Generation 244
6.2.1 The Apriori Principle 246
6.2.2 Frequent Itemset Generation in the Apriori Algorithm 247
6.2.3 Candidate Generation and Pruning 249
6.2.4 Support Counting 252
6.2.5 Computational Complexity 255
6.3 Rule Generation 257
6.3.1 Confidence-Based Pruning 258
6.3.2 Rule Generation in Apriori Algorithm 258
6.3.3 An Example: Congressional Voting Records 259
6.4 Compact Representation of Frequent Itemsets 260
6.4.1 Maximal Frequent Itemsets 260
6.4.2 Closed Frequent Itemsets 262
6.5 Alternative Methods for Generating Frequent Itemsets 264
6.6.2 Frequent Itemset Generation in FP-Growth Algorithm 270
6.7 Evaluation of Association Patterns 273
6.7.1 Objective Measures of Interestingness 274
6.7.2 Measures beyond Pairs of Binary Variables 282
6.7.3 Simpson's Paradox 283
6.8 Effect of Skewed Support Distribution 285
6.9 Bibliographic Notes 288
6.10 Exercises 298
7 Association Analysis: Advanced Concepts 307
7.1 Handling Categorical Attributes 307
7.2 Handling Continuous Attributes 309
7.2.1 Discretization-Based Methods 310
7.2.2 Statistics-Based Methods 312
7.2.3 Non-discretization Methods 314
7.3 Handling a Concept Hierarchy 316
7.4 Sequential Patterns 318
7.4.1 Problem Formulation 318
7.4.2 Sequential Pattern Discovery 320
7.4.3 Timing Constraints 323
7.4.4 Alternative Counting Schemes 327
7.5 Subgraph Patterns 328
7.5.1 Graphs and Subgraphs 329
7.5.2 Frequent Subgraph Mining 330
7.5.3 Apriori-like Method 332
7.5.4 Candidate Generation 333
7.5.5 Candidate Pruning 338
7.5.6 Support Counting 340
7.6 Infrequent Patterns 340
7.6.1 Negative Patterns 341
7.6.2 Negatively Correlated Patterns 342
7.6.3 Comparisons among Infrequent Patterns, Negative Patterns, and Negatively Correlated Patterns 343
7.6.4 Techniques for Mining Interesting Infrequent Patterns 344
7.6.5 Techniques Based on Mining Negative Patterns 345
7.6.6 Techniques Based on Support Expectation 347
7.7 Bibliographic Notes 350
7.8 Exercises 353
8 Cluster Analysis: Basic Concepts and Algorithms 363
8.1 Overview 365
8.1.1 What Is Cluster Analysis? 365
8.1.2 Different Types of Clusterings 366
8.1.3 Different Types of Clusters 368
8.2 K-means 370
8.2.1 The Basic K-means Algorithm 371
8.2.2 K-means: Additional Issues 378
8.2.3 Bisecting K-means 380
8.2.4 K-means and Different Types of Clusters 381
8.2.5 Strengths and Weaknesses 383
8.2.6 K-means as an Optimization Problem 383
8.3 Agglomerative Hierarchical Clustering 385
8.3.1 Basic Agglomerative Hierarchical Clustering Algorithm 385
8.3.2 Specific Techniques 387
8.3.3 The Lance-Williams Formula for Cluster Proximity 391
8.3.4 Key Issues in Hierarchical Clustering 391
8.3.5 Strengths and Weaknesses 393
8.4.1 Traditional Density: Center-Based Approach 393
8.4.2 The DBSCAN Algorithm 394
8.4.3 Strengths and Weaknesses 398
8.5 Cluster Evaluation 398
8.5.1 Overview 399
8.5.2 Unsupervised Cluster Evaluation Using Cohesion and Separation 401
8.5.3 Unsupervised Cluster Evaluation Using the Proximity Matrix 406
8.5.4 Unsupervised Evaluation of Hierarchical Clustering 408
8.5.5 Determining the Correct Number of Clusters 409
8.5.6 Clustering Tendency 410
8.5.7 Supervised Measures of Cluster Validity 411
8.5.8 Assessing the Significance of Cluster Validity Measures 414
8.6 Bibliographic Notes 416
8.7 Exercises 419
9 Cluster Analysis: Additional Issues and Algorithms 427
9.1 Characteristics of Data, Clusters, and Clustering Algorithms 427
9.1.1 Example: Comparing
K-means and DBSCAN 428
9.1.2 Data Characteristics 429
9.1.3 Cluster Characteristics 430
9.1.4 General Characteristics of Clustering Algorithms 431
9.2 Prototype-Based Clustering 433
9.2.1 Fuzzy Clustering 433
9.2.2 Clustering Using Mixture Models 437
9.2.3 Self-Organizing Maps (SOM) 446
9.3 Density-Based Clustering 451
9.3.1 Grid-Based Clustering 451
9.3.2 Subspace Clustering 454
9.3.3 DENCLUE: A Kernel-Based Scheme for Density-Based Clustering 457
9.4 Graph-Based Clustering 460
9.4.1 Sparsification 461
9.4.2 Minimum Spanning Tree (MST) Clustering 462
9.4.3 OPOSSUM: Optimal Partitioning of Sparse Similarities Using METIS 463
9.4.4 Chameleon: Hierarchical Clustering with Dynamic Modeling 464
9.4.5 Shared Nearest Neighbor Similarity 468
9.4.6 The Jarvis-Patrick Clustering Algorithm 471
9.4.7 SNN Density 472
9.4.8 SNN Density-Based Clustering 473
9.5 Scalable Clustering Algorithms 475
9.5.1 Scalability: General Issues and Approaches 476
9.5.2 BIRCH 477
9.5.3 CURE 479
9.6 Which Clustering Algorithm? 482
9.7 Bibliographic Notes 484
9.8 Exercises 488
10 Anomaly Detection 491
10.1 Preliminaries 492
10.1.1 Causes of Anomalies 492
10.1.2 Approaches to Anomaly Detection 493
10.1.3 The Use of Class Labels 494
10.1.4 Issues 495
10.2 Statistical Approaches 496
10.2.1 Detecting Outliers in a Univariate Normal Distribution 497
10.2.2 Outliers in a Multivariate Normal Distribution 499
10.2.3 A Mixture Model Approach for Anomaly Detection 500
10.2.4 Strengths and Weaknesses 502
10.3 Proximity-Based Outlier Detection 502
10.3.1 Strengths and Weaknesses 503
10.4 Density-Based Outlier Detection 504
10.4.1 Detection of Outliers Using Relative Density 505
10.4.2 Strengths and Weaknesses 506
10.5 Clustering-Based Techniques 506
10.5.1 Assessing the Extent to Which an Object Belongs to a Cluster 507
10.5.2 Impact of Outliers on the Initial Clustering 509
10.5.3 The Number of Clusters to Use 509
10.5.4 Strengths and Weaknesses 509
10.6 Bibliographic Notes 510
10.7 Exercises 513