編輯推薦
本書是Web挖掘與搜尋引擎領域的經典著作,自出版以來深受好評,已經被斯坦福、普林斯頓、卡內基梅隆等世界名校採用為教材。書中首先介紹了Web爬行和搜尋等許多基礎性的問題,並以此為基礎,深入闡述了解決Web挖掘各種難題所涉及的機器學習技術,提出了機器學習在系統獲取、存儲和分析數據中的許多套用,並探討了這些套用的優劣和發展前景。 全書分析透徹,富於前瞻性,為構建
Web挖掘創新性套用奠定了理論和實踐基礎,既適用於信息檢索和機器學習領域的研究人員和高校師生,也是廣大Web開發人員的優秀參考書。
“本書深入揭示了搜尋引擎的技術內幕!有了它,你甚至能夠自己開發一個搜尋引擎。”
——searchenginewatch網站
“本書系統、全面而且深入,廣大Web技術開發人員都能很好地理解和掌握其中內容。作者是該研究領域的領軍人物之一,在
超文本信息挖掘和檢索方面有著淵博的知識和獨到的見解。”
“作者將該領域的所有重要工作融合到這部傑作中,並以一種通俗易懂的方式介紹了原本非常
深奧的內容。有了這本書,
Web挖掘終於有可能成為大學的一門課程了。”
——Jaideep Srivastava,
明尼蘇達大學教授,IEEE會士
內容簡介
《Web數據挖掘》是適用於
數據挖掘學術研究和開發的專業人員的參考書,同時也適合作為高等院校計算機及相關專業研究生的教材。書中首先論述了Web的基礎(包括Web信息採集機制、Web標引機制以及基於關鍵字或基於相似性搜尋機制),然後系統地描述了
Web挖掘的基礎知識,著重介紹基於
超文本的機器學習和
數據挖掘方法,如
聚類、
協同過濾、
監督學習、
半監督學習,最後講述了這些基本原理在Web挖掘中的套用。《Web數據挖掘》為讀者提供了堅實的技術背景和最新的知識。
作品目錄
INTRODUCTION
1.1 Crawling and Indexing
1.2 Topic Directories
1.3 Clustering and Classification
1.4 Hyperlink Analysis
1.5 Resource Discovery and Vertical Portals
1.6 Structured vs. Unstructured Data Mining
1.7 Bibliographic Notes
PART Ⅰ INFRASTRUCTURE
2 CRAWLING THE WEB
2.1 HTML and HTTP Basics
2.2 Crawling Basics
2.3 Engineering Large-Scale Crawlers
2.3.1 DNS Caching, Prefetching, and Resolution
2.3.2 Multiple Concurrent Fetches
2.3.3 Link Extraction and Normalization
2.3.4 Robot Exclusion
2.3.5 Eliminating Already-Visited URLs
2.3.6 Spider Traps
2.3.7 Avoiding Repeated Expansion of Links on Duplicate Pages
2.3.8 Load Monitor and Manager
2.3.9 Per-Server Work-Queues
2.3.10 Text Repository
2.3.11 Refreshing Crawled Pages
2.4 Putting Together a Crawler
2.4.1 Design of the Core Components
2.4.2 Case Study: Using w3c-1 i bwww
2.5 Bibliographic Notes
3 WEB SEARCH AND INFORMATION RETRIEVAL
3.1 Boolean Queries and the Inverted Index
3.1.1 Stopwords and Stemming
3.1.2 Batch Indexing and Updates
3.1.3 Index Compression Techniques
3.2 Relevance Ranking
3.2.1 Recall and Precision
3.2.2 The Vector-Space Model
3.2.3 Relevance Feedback and Rocchio's Method
3.2.4 Probabilistic Relevance Feedback Models
3.2.5 Advanced Issues
3.3 Similarity Search
3.3.1 Handling "Find-Similar" Queries
3.3.2 Eliminating Near Duplicates via Shingling
3.3.3 Detecting Locally Similar Subgraphs of the Web
3.4 Bibliographic Notes
PART Ⅱ LEARNING
SIMILARITY AND CLUSTERING
4.1 Formulations and Approaches
4.1.1 Partitioning Approaches
4.1.2 Geometric Embedding Approaches
4.1.3 Generative Models and Probabilistic Approaches
4.2 Bottom-Up and Top-Down Partitioning Paradigms
4.2.1 Agglomerative Clustering
4.3 Clustering and Visualization via Embeddings
4.3.1 Self-Organizing Maps (SOMs)
4.3.2 Multidimensional Scaling (MDS) and FastMap
4.3.3 Projections and Subspaces
4.3.4 Latent Semantic Indexing (LSI)
4.4 Probabilistic Approaches to Clustering
4.4.1 Generative Distributions for Documents
4.4.2 Mixture Models and Expectation Maximization (EM)
4.4.3 Multiple Cause Mixture Model (MCMM)
4.4.4 Aspect Models and Probabilistic LSI
4.4.5 Model and Feature Selection
4.5 Collaborative Filtering
4.5.1 Probabilistic Models
4.5.2 Combining Content-Based and Collaborative Features
4.6 Bibliographic Notes
5 SUPERVISED LEARNING
5.1 The Supervised Learning Scenario
5.2 Overview of Classification Strategies
5.3 Evaluating Text Classifiers
5.3.1 Benchmarks
5.3.2 Measures of Accuracy
5.4 Nearest Neighbor Learners
5.4.1 Pros and Cons
5.4.2 Is TFIDF Appropriate?
5.5 Feature Selection
5.5.1 Greedy Inclusion Algorithms
5.5.2 Truncation Algorithms
5.5.3 Comparison and Discussion
5.6 Bayesian Learners
5.6.1 Naive Bayes Learners
5.6.2 Small-Degree Bayesian Networks
5.7 Exploiting Hierarchy among Topics
5.7.1 Feature Selection
5.7.2 Enhanced Parameter Estimation
5.7.3 Training and Search Strategies
5.8 Maximum Entropy Learners
5.9 Discriminative Classification
5.9.1 Linear Least-Square Regression
5.9.2 Support Vector Machines
5.10 Hypertext Classification
5.10.1 Representing Hypertext for Supervised Learning
5.10.2 Rule Induction
5.11 Bibliographic Notes
6 SEMISUPERVISED LEARNING
6.1 Expectation Maximization
6.1.1 Experimental Results
6.1.2 Reducing the Belief in Unlabeled Documents
6.1.3 Modeling Labels Using Many Mixture Components
……
PART Ⅲ APPLICATIONS
……
序言
This book is about finding significant statistical patterns relating hypertext documents, topics, hyperlinks, and queries and using these patterns to connect users to information they seek. The Web has become a vast storehouse of knowledge, built in a decentralized yet collaborative manner. It is a living, growing, populist, and participatory medium of expression with no central editorship. This has positive and negative implications. On the positive side, there is widespread participation in authoring content. Compared to print or broadcast media, the ratio of content creators to the audience is more equitable. On the negative side, the heterogeneity and lack of structure makes it hard to frame queries and satisfy information needs. For many queries posed with the help of words and phrases, there are thousands of apparently relevant responses, but on closer inspection these turn out to be disappointing for all but the simplest queries. Queries involving nouns and noun phrases, where the information need is to find out about the named entity, are the simplest sort of information-hunting tasks. Only sophisticated users succeed with more complex queries——for instance, those that involve articles and prepositions to relate named objects, actions, and agents. If you are a regular seeker and user of Web information, this state of affairs needs no further description.
Detecting and exploiting statistical dependencies between terms, Web pages, and hyperlinks will be the central theme in this book. Such dependencies are also called patterns, and the act of searching for such patterns is called machine learning, or data mining. Here are some examples of machine learning for Web applications. Given a crawl of a substantial portion of the Web, we may be interested in constructing a topic directory like Yahoo!, perhaps detecting the emergence and decline of prominent topics with passing time. Once a topic directory is available, we may wish to assign freshly crawled pages and sites to suitable positions in the directory.
作者簡介
Soumen Chakrabarti,Web搜尋與挖掘領域的知名專家,ACM Transactions on the Web副主編。加州大學伯克利分校博士,是
印度理工學院計算機科學與工程系副教授。曾經供職於IBM Almaden研究中心,從事
超文本資料庫和數據挖掘方面的工作。他有豐富的實際項目開發經驗,開發了多個
Web挖掘系統,並獲得了多項美國專利。