《數據挖掘導論(英文版)》全面介紹了數據挖掘的理論和方法,著重介紹如何用數據挖掘知識解決各種實際問題,涉及學科領域眾多,適用面廣。書中涵蓋5個主題:數據、分類、關聯分析、聚類和異常檢測。除異常檢測外,每個主題都包含兩章:前面一章講述基本概念、代表性算法和評估技術,後面一章較深入地討論高級概念和算法。目的是使讀者在透徹地理解數據挖掘基礎的同時,還能了解更多重要的高級主題。.包含大量的圖表、綜合示例和豐富的習題。·不需要資料庫背景。只需要很少的統計學或數學背景知識。·網上配套教輔資源豐富,包括PPT、習題解答、數據集等。 海報:
基本介紹
- 書名:經典原版書庫:數據挖掘導論
- 作者:譚(Pang-Ning Tan) 斯坦巴克(Michael Steinbach)
- 出版社:機械工業出版社
- 頁數:769頁
- 開本:32
- 定價:59.00
- 外文名:Introduction to Data Mining
- 類型:科技
- 出版日期:2010年9月1日
- 語種:英語
- ISBN:9787111316701, 7111316703
- 品牌:機械工業出版社
基本介紹,內容簡介,作者簡介,圖書目錄,序言,
基本介紹
內容簡介
《數據挖掘導論(英文版)》是經典原版書庫。
作者簡介
作者:(美國)譚(Pang-Ning Tan) (美國)斯坦巴克(Michael Steinbach) (美國)庫馬爾(Vipin Kumar)
Pang.Ning Tan現為密西根州立大學計算機與工程系助理教授,主要教授數據挖掘、資料庫系統等課程。他的研究主要關注於為廣泛的套用(包括醫學信息學、地球科學、社會網路、Web挖掘和計算機安全)開發適用的數據挖掘算法。
Michael Steinbach擁有明尼蘇達大學數學學士學位、統計學碩士學位和計算機科學博士學位,現為明尼蘇達大學雙城分校計算機科學與工程系助理研究員。
Vipin Kumar現為明尼蘇達大學計算機科學與工程系主任和William Norris教授。1 988年至2005年。他曾擔任美國陸軍高性能計算研究中心主任。
Pang.Ning Tan現為密西根州立大學計算機與工程系助理教授,主要教授數據挖掘、資料庫系統等課程。他的研究主要關注於為廣泛的套用(包括醫學信息學、地球科學、社會網路、Web挖掘和計算機安全)開發適用的數據挖掘算法。
Michael Steinbach擁有明尼蘇達大學數學學士學位、統計學碩士學位和計算機科學博士學位,現為明尼蘇達大學雙城分校計算機科學與工程系助理研究員。
Vipin Kumar現為明尼蘇達大學計算機科學與工程系主任和William Norris教授。1 988年至2005年。他曾擔任美國陸軍高性能計算研究中心主任。
圖書目錄
Preface
1 Introduction
1.1 What Is Data Mining?
1.2 Motivating Challenges
1.3 The Origins of Data Mining
1.4 Data Mining Tasks
1.5 Scope and Organization of the Book
1.6 Bibliographic Notes
1.7 Exercises
2 Data
2.1 Types of Data
2.1.1 Attributes and Measurement
2.1.2 Types of Data Sets
2.2 Data Quality
2.2.1 Measurement and Data Collection Issues
2.2.2 Issues Related to Applications
2.3 Data Preprocessing
2.3.1 Aggregation
2.3.2 Sampling
2.3.3 Dimensionality Reduction
2.3.4 Feature Subset Selection
2.3.5 Feature Creation
2.3.6 Discretization and Binarization
2.3.7 Variable Transformation
2.4 Measures of Similarity and Dissimilarity
2.4.1 Basics
2.4.2 Similarity and Dissimilarity between Simple Attributes.
2.4.3 Dissimilarities between Data Objects
2.4.4 Similarities between Data Objects
2.4.5 Examples of Proximity Measures
2.4.6 Issues in Proximity Calculation
2.4.7 Selecting the Right Proximity Measure
2.5 Bibliographic Notes
2.6 Exercises
3 Exploring Data
3.1 The Iris Data Set
3.2 Summary Statistics
3.2.1 Frequencies and the Mode
3.2.2 Percentiles
3.2.3 Measures of Location: Mean and Median
3.2.4 Measures of Spread: Range and Variance
3.2.5 Multivariate Summary Statistics
3.2.6 Other Ways to Summarize the Data
3.3 Visualization
3.3.1 Motivations for Visualization
3.3.2 General Concepts
3.3.3 Techniques
3.3.4 Visualizing Higher-Dimensional Data
3.3.5 Do's and Don'ts
3.4 OLAP and Multidimensional Data Analysis
3.4.1 Representing Iris Data as a Multidimensional Array
3.4.2 Multidimensional Data: The General Case
3.4.3 Analyzing Multidimensional Data
3.4.4 Final Comments on Multidimensional Data Analysis
3.5 Bibliographic Notes
3.6 Exercises
Classification:
Basic Concepts, Decision Trees, and Model Evaluation
4.1 Preliminaries
4.2 General Approach to Solving a Classification Problem
4.3 Decision Tree Induction
4.3.1 How a Decision Tree Works
4.3.2 How to Build a Decision Tree
4.3.3 Methods for Expressing Attribute Test Conditions .
4.3.4 Measures for Selecting the Best Split
4.3.5 Algorithm for Decision Tree Induction
4.3.6 An Example: Web Robot Detection
4.3.7 Characteristics of Decision Tree Induction
4.4 Model Overfitting
4.4.1 Overfitting Due to Presence of Noise
4.4.2 Overfitting Due to Lack of Representative Samples .
4.4.3 Overfitting and the Multiple Comparison Procedure
4.4.4 Estimation of Generalization Errors
4.4.5 Handling Overfitting in Decision Tree Induction . .
4.5 Evaluating the Performance of a Classifier
4.5.1 Holdout Method
4.5.2 Random Subsampling
4.5.3 Cross-Validation
4.5.4 Bootstrap
4.6 Methods for Comparing Classifiers
4.6.1 Estimating a Confidence Interval for Accuracy . . .
4.6.2 Comparing the Performance of Two Models
4.6.3 Comparing the Performance of Two Classifiers . . .
4.7 Bibliographic Notes
4.8 Exercises
Classification: Alternative Techniques
5.1 Rule-Based Classifier
5.1.1 How a Rule-Based Classifier Works
5.1.2 Rule-Ordering Schemes
5.1.3 How to Build a Rule-Based Classifier
5.1.4 Direct Methods for Rule Extraction
5.1.5 Indirect Methods for Rule Extraction
5.1.6 Characteristics of Rule-Based Classifiers
5.2 Nearest-Neighbor classifiers
5.2.1 Algorithm
5.2.2 Characteristics of Nearest-Neighbor Classifiers
5.3 Bayesian Classifiers
5.3.1 Bayes Theorem
5.3.2 Using the Bayes Theorem for Classification
5.3.3 Naive Bayes Classifier
5.3.4 Bayes Error Rate
5.3.5 Bayesian Belief Networks
5.4 Artificial Neural Network (ANN)
5.4.1 Perceptron
5.4.2 Multilayer Artificial Neural Network
5.4.3 Characteristics of ANN
5.5 Support Vector Machine (SVM)
5.5.1 Maximum Margin Hyperplanes
5.5.2 Linear SVM: Separable Case
5.5.3 Linear SVM: Nonseparable Case
5.5.4 Nonlinear SVM
5.5.5 Characteristics of SVM
5.6 Ensemble Methods
5.6.1 Rationale for Ensemble Method
5.6.2 Methods for Constructing an Ensemble Classifier
5.6.3 Bias-Variance Decomposition
5.6.4 Bagging
5.6.5 Boosting
5.6.6 Random Forests
5.6.7 Empirical Comparison among Ensemble Methods
5.7 Class Imbalance Problem
5.7.1 Alternative Metrics
5.7.2 The Receiver Operating Characteristic Curve
5.7.3 Cost-Sensitive Learning
5.7.4 Sampling-Based Approaches
5.8 Multiclass Problem
5.9 Bibliographic Notes
5.10 Exercises
6 Association Analysis: Basic Concepts and Algorithms
6.1 Problem Definition
6.2 Frequent Itemset Generation
6.2.1 The Apriori Principle
6.2.2 Frequent Itemset Generation in the Apriori Algorithm .
6.2.3 Candidate Generation and Pruning
6.2.4 Support Counting
6.2.5 Computational Complexity
6.3 Rule Generation
6.3.1 Confidence-Based Pruning
6.3.2 Rule Generation in Apriori Algorithm
6.3.3 An Example: Congressional Voting Records
6.4 Compact Representation of Frequent Itemsets
6.4.1 Maximal Frequent Itemsets
6.4.2 Closed Frequent Itemsets
6.5 Alternative Methods for Generating Frequent Itemsets
6.6 FP-Growth Algorithm
……
1 Introduction
1.1 What Is Data Mining?
1.2 Motivating Challenges
1.3 The Origins of Data Mining
1.4 Data Mining Tasks
1.5 Scope and Organization of the Book
1.6 Bibliographic Notes
1.7 Exercises
2 Data
2.1 Types of Data
2.1.1 Attributes and Measurement
2.1.2 Types of Data Sets
2.2 Data Quality
2.2.1 Measurement and Data Collection Issues
2.2.2 Issues Related to Applications
2.3 Data Preprocessing
2.3.1 Aggregation
2.3.2 Sampling
2.3.3 Dimensionality Reduction
2.3.4 Feature Subset Selection
2.3.5 Feature Creation
2.3.6 Discretization and Binarization
2.3.7 Variable Transformation
2.4 Measures of Similarity and Dissimilarity
2.4.1 Basics
2.4.2 Similarity and Dissimilarity between Simple Attributes.
2.4.3 Dissimilarities between Data Objects
2.4.4 Similarities between Data Objects
2.4.5 Examples of Proximity Measures
2.4.6 Issues in Proximity Calculation
2.4.7 Selecting the Right Proximity Measure
2.5 Bibliographic Notes
2.6 Exercises
3 Exploring Data
3.1 The Iris Data Set
3.2 Summary Statistics
3.2.1 Frequencies and the Mode
3.2.2 Percentiles
3.2.3 Measures of Location: Mean and Median
3.2.4 Measures of Spread: Range and Variance
3.2.5 Multivariate Summary Statistics
3.2.6 Other Ways to Summarize the Data
3.3 Visualization
3.3.1 Motivations for Visualization
3.3.2 General Concepts
3.3.3 Techniques
3.3.4 Visualizing Higher-Dimensional Data
3.3.5 Do's and Don'ts
3.4 OLAP and Multidimensional Data Analysis
3.4.1 Representing Iris Data as a Multidimensional Array
3.4.2 Multidimensional Data: The General Case
3.4.3 Analyzing Multidimensional Data
3.4.4 Final Comments on Multidimensional Data Analysis
3.5 Bibliographic Notes
3.6 Exercises
Classification:
Basic Concepts, Decision Trees, and Model Evaluation
4.1 Preliminaries
4.2 General Approach to Solving a Classification Problem
4.3 Decision Tree Induction
4.3.1 How a Decision Tree Works
4.3.2 How to Build a Decision Tree
4.3.3 Methods for Expressing Attribute Test Conditions .
4.3.4 Measures for Selecting the Best Split
4.3.5 Algorithm for Decision Tree Induction
4.3.6 An Example: Web Robot Detection
4.3.7 Characteristics of Decision Tree Induction
4.4 Model Overfitting
4.4.1 Overfitting Due to Presence of Noise
4.4.2 Overfitting Due to Lack of Representative Samples .
4.4.3 Overfitting and the Multiple Comparison Procedure
4.4.4 Estimation of Generalization Errors
4.4.5 Handling Overfitting in Decision Tree Induction . .
4.5 Evaluating the Performance of a Classifier
4.5.1 Holdout Method
4.5.2 Random Subsampling
4.5.3 Cross-Validation
4.5.4 Bootstrap
4.6 Methods for Comparing Classifiers
4.6.1 Estimating a Confidence Interval for Accuracy . . .
4.6.2 Comparing the Performance of Two Models
4.6.3 Comparing the Performance of Two Classifiers . . .
4.7 Bibliographic Notes
4.8 Exercises
Classification: Alternative Techniques
5.1 Rule-Based Classifier
5.1.1 How a Rule-Based Classifier Works
5.1.2 Rule-Ordering Schemes
5.1.3 How to Build a Rule-Based Classifier
5.1.4 Direct Methods for Rule Extraction
5.1.5 Indirect Methods for Rule Extraction
5.1.6 Characteristics of Rule-Based Classifiers
5.2 Nearest-Neighbor classifiers
5.2.1 Algorithm
5.2.2 Characteristics of Nearest-Neighbor Classifiers
5.3 Bayesian Classifiers
5.3.1 Bayes Theorem
5.3.2 Using the Bayes Theorem for Classification
5.3.3 Naive Bayes Classifier
5.3.4 Bayes Error Rate
5.3.5 Bayesian Belief Networks
5.4 Artificial Neural Network (ANN)
5.4.1 Perceptron
5.4.2 Multilayer Artificial Neural Network
5.4.3 Characteristics of ANN
5.5 Support Vector Machine (SVM)
5.5.1 Maximum Margin Hyperplanes
5.5.2 Linear SVM: Separable Case
5.5.3 Linear SVM: Nonseparable Case
5.5.4 Nonlinear SVM
5.5.5 Characteristics of SVM
5.6 Ensemble Methods
5.6.1 Rationale for Ensemble Method
5.6.2 Methods for Constructing an Ensemble Classifier
5.6.3 Bias-Variance Decomposition
5.6.4 Bagging
5.6.5 Boosting
5.6.6 Random Forests
5.6.7 Empirical Comparison among Ensemble Methods
5.7 Class Imbalance Problem
5.7.1 Alternative Metrics
5.7.2 The Receiver Operating Characteristic Curve
5.7.3 Cost-Sensitive Learning
5.7.4 Sampling-Based Approaches
5.8 Multiclass Problem
5.9 Bibliographic Notes
5.10 Exercises
6 Association Analysis: Basic Concepts and Algorithms
6.1 Problem Definition
6.2 Frequent Itemset Generation
6.2.1 The Apriori Principle
6.2.2 Frequent Itemset Generation in the Apriori Algorithm .
6.2.3 Candidate Generation and Pruning
6.2.4 Support Counting
6.2.5 Computational Complexity
6.3 Rule Generation
6.3.1 Confidence-Based Pruning
6.3.2 Rule Generation in Apriori Algorithm
6.3.3 An Example: Congressional Voting Records
6.4 Compact Representation of Frequent Itemsets
6.4.1 Maximal Frequent Itemsets
6.4.2 Closed Frequent Itemsets
6.5 Alternative Methods for Generating Frequent Itemsets
6.6 FP-Growth Algorithm
……
序言
Advances in data generation and collection are producing data sets of massire size in commerce and a variety of scientific disciplines.Data warehouses store details of the sales and operations of businesses,Earth-orbiting satelfites beam high-resolution images and sensor data back to Earth.and genomics ex- periments generate sequence,structural,and functional data for an increasing number of organisms.
The ease with Which data can now be gathered and stored has created a new attitude toward data analysis:Gather whatever data you can whenever and wherever possible.It has become an article of faith that the gathered data will have value.either for the purpose that initially motivated its collection or for purposes not yet envisioned.
The field of data mining grew out of the limitations of current data analysis techniques in handling the challenges posed by these new types of data sets.Data mining does not replace other areas of data analysis,but rat.Her takes them as the foundation for much of its work.While some areas of data mining,such as association analysis,are unique to the field,other areas,such as clustering,classification, and anomaly detection,build upon a long history of work on these topics in other fields.Indeed.the willingness of data mining researchers to draw upon existing techniques has contributed to the strength and breadth of the field,as well as to its rapid growth.
The ease with Which data can now be gathered and stored has created a new attitude toward data analysis:Gather whatever data you can whenever and wherever possible.It has become an article of faith that the gathered data will have value.either for the purpose that initially motivated its collection or for purposes not yet envisioned.
The field of data mining grew out of the limitations of current data analysis techniques in handling the challenges posed by these new types of data sets.Data mining does not replace other areas of data analysis,but rat.Her takes them as the foundation for much of its work.While some areas of data mining,such as association analysis,are unique to the field,other areas,such as clustering,classification, and anomaly detection,build upon a long history of work on these topics in other fields.Indeed.the willingness of data mining researchers to draw upon existing techniques has contributed to the strength and breadth of the field,as well as to its rapid growth.