《現代信息檢索(英文第2版)》是2011年機械工業出版社出版的圖書,作者是(西班牙)RicardoBaeza-Yates,(巴西)BerthierRibeiro-Neto。
基本介紹
- 書名:現代信息檢索(英文第2版)
- 作者: (西班牙)Ricardo Baeza-Yates,(巴西)Berthier Ribeiro-Neto
- 原版名稱:Modern Information Retrieval: The Concepts and Technology behind Search (2nd Edition)
- ISBN:9787111331742
- 頁數:913
- 出版社:機械工業出版社
- 出版時間:2011 年3月
- 開本:32
- 叢書名:經典原版書庫
內容簡介,目錄,
內容簡介
《現代信息檢索(英文版.第2版)》詳細介紹了信息檢索的所有主要概念和技術,以及有關信息檢索方面的所有新變化,使讀者既可以對現代信息檢索有一個全面的了解,又可以獲取現代信息檢索所有關鍵主題的詳細知識。《現代信息檢索(英文版.第2版)》的主要內容由信息檢索領域的代表人物baeza-yates和ribeiro-neto編著;對於那些希望深入研究關鍵領域的讀者,《現代信息檢索(英文版.第2版)》中還提供了由其他主要研究人員編寫的關於特殊主題的發展現狀。
與上一版相比,《現代信息檢索(英文版.第2版)》在內容和結構上都有大量調整、更新和充實,其中新增內容在60%到70%左右。具體更新情況如下:
·新增了文本分類、網路信息爬取、結構化文本檢索和企業搜尋等章節,以及關於開源搜尋的一個附錄。
·全面改寫了用戶界面、多媒體檢索和數字圖書館等內容。
·拓展了一些章節,介紹了信息檢索方面的新的重要進展,如語言模型、新的評價方法、查詢的特點、基於聚類和分散式信息檢索等。
目錄
1 introduction 1
1.1 information retrieval 1
1.1.1 early developments 1
1.1.2 information retrieval in libraries and digital libraries 3
1.1.3 ir at the center of the stage 3
1.2 the ir problem 3
1.2.1 the user’s task 4
1.2.2 information versus data retrieval 5
1.3 the ir system 5
1.3.1 software architecture of the ir system 5
1.3.2 the retrieval and ranking processes 7
1.4 theweb 8
1.4.1 a brief history 8
1.4.2 the e-publishing era 9
1.4.3 how the web changed search 10
1.4.4 practical issues on the web 12
1.5 organization of the book 12
1.5.1 focus of the book 12
1.5.2 book contents 13
1.6 the book web site: a teaching resource 16
.1.7 bibliographic discussion 17
2 user interfaces for search 21
by marti hearst
2.1 introduction 21
2.2 how people search 21
preface to the second edition v
preface to the first edition vii
authors’ acknowledgements to the second edition viii
authors’ acknowledgements to the first edition x
publishers’ acknowledgements xii
contents xvii
2.2.1 information lookup versus exploratory search 22
2.2.2 classic versus dynamic model of information seeking 23
2.2.3 navigation versus search 24
2.2.4 observations of the search process 24
2.3 search interfaces today 25
2.3.1 getting started 25
2.3.2 query specification 26
2.3.3 query specification interfaces 27
2.3.4 retrieval results display 29
2.3.5 query reformulation 32
2.3.6 organizing search results 35
2.4 visualization in search interfaces 40
2.4.1 visualizing boolean syntax 42
2.4.2 visualizing query terms within retrieval results 43
2.4.3 visualizing relationships among words and documents 47
2.4.4 visualization for text mining 49
2.5 design and evaluation of search interfaces 50
2.6 trends and research issues 54
2.7 bibliographic discussion 54
3 modeling 57
3.1 ir models 57
3.1.1 modeling and ranking 57
3.1.2 characterization of an ir model 58
3.1.3 a taxonomy of ir models 59
3.2 classic information retrieval 61
3.2.1 basic concepts 61
3.2.2 the boolean model 64
3.2.3 term weighting 66
3.2.4 tf-idf weights 68
3.2.5 document length normalization 75
3.2.6 the vector model 77
3.2.7 the probabilistic model 79
3.2.8 brief comparison of classic models 86
3.3 alternative set theoretic models 87
3.3.1 set-based model 87
3.3.2 extended boolean model 92
3.3.3 fuzzy set model 95
3.4 alternative algebraic models 98
3.4.1 generalized vector space model 98
3.4.2 latent semantic indexing model 101
3.4.3 neural network model 102
3.5 alternative probabilistic models 104
3.5.1 bm25 104
3.5.2 language models 107
3.5.3 divergence from randomness 113
3.5.4 bayesian network models 116
3.6 other models 124
3.6.1 the hypertext model 124
3.6.2 web based models 125
3.6.3 structured text retrieval 126
3.6.4 multimedia retrieval 126
3.6.5 enterprise and vertical search 126
3.7 trends and research issues 127
3.8 bibliographic discussion 128
4 retrieval evaluation 131
4.1 introduction 131
4.2 the cranfield paradigm 132
4.2.1 a brief history 132
4.2.2 reference collections 134
4.3 retrieval metrics 134
4.3.1 precision and recall 135
4.3.2 single value summaries: p@n, map, mrr, f 139
4.3.3 user-oriented measures 144
4.3.4 dcg: discounted cumulated gain 145
4.3.5 bpref: binary preferences 150
4.3.6 rank correlation metrics 153
4.4 reference collections 158
4.4.1 the trec collections 159
4.4.2 other reference collections 166
4.4.3 other small test collections 167
4.5 user-based evaluation 168
4.5.1 human experimentation in the lab 168
4.5.2 side-by-side panels 168
4.5.3 a/b testing 169
4.5.4 crowdsourcing 170
4.5.5 evaluation using clickthrough data 171
4.6 practical caveats 173
4.7 trends and research issues 174
4.8 bibliographic discussion 174
5 relevance feedback and query expansion 177
5.1 introduction 177
5.2 a framework for feedback methods 178
5.3 explicit relevance feedback 180
5.3.1 relevance feedback for the vector model: rocchio method 181
5.3.2 relevance feedback for the probabilistic model 183
5.3.3 evaluation of relevance feedback 184
5.4 explicit feedback through clicks 185
5.4.1 eye tracking and relevance judgements 185
5.4.2 user behavior 186
5.4.3 clicks as a metric of user preferences 187
5.5 implicit feedback through local analysis 190
5.5.1 implicit feedback through local clustering 190
5.5.2 implicit feedback through local context analysis 193
xviii contents
5.6 implicit feedback through global analysis 195
5.6.1 query expansion based on a similarity thesaurus 195
5.6.2 query expansion based on a statistical thesaurus 198
5.7 trends and research issues 200
5.8 bibliographic discussion 200
6 documents: languages & properties 203
with gonzalo navarro and nivio ziviani
6.1 introduction 203
6.2 metadata 205
6.3 document formats 206
6.3.1 text 206
6.3.2 multimedia 207
6.3.3 graphics and virtual reality 208
6.4 markup languages 208
6.4.1 sgml 209
6.4.2 html 211
6.4.3 xml 214
6.4.4 rdf: resource description framework 216
6.4.5 hytime 217
6.5 text properties 218
6.5.1 information theory 218
6.5.2 modeling natural language 219
6.5.3 text similarity 222
6.6 document preprocessing 223
6.6.1 lexical analysis of the text 224
6.6.2 elimination of stopwords 226
6.6.3 stemming 226
6.6.4 keyword selection 227
6.6.5 thesauri 228
6.7 organizing documents 231
6.7.1 taxonomies 231
6.7.2 folksonomies 232
6.8 text compression 233
6.8.1 basic concepts 234
6.8.2 statistical methods 234
6.8.3 statistical methods: modeling 235
6.8.4 statistical methods: coding 238
6.8.5 dictionary methods 245
6.8.6 preprocessing for compression 246
6.8.7 comparing text compression techniques 248
6.8.8 structured text compression 249
6.9 trends and research issues 250
6.10 bibliographical discussion 253
7 queries: languages & properties 255
with gonzalo navarro
7.1 query languages 255
contents xix
7.1.1 keyword-based querying 256
7.1.2 beyond keywords 259
7.1.3 structural queries 262
7.1.4 query protocols 265
7.2 query properties 267
7.2.1 characterizing web queries 267
7.2.2 user search behavior 269
7.2.3 query intent 270
7.2.4 query topic 272
7.2.5 query sessions and missions 273
7.2.6 query difficulty 274
7.3 trends and research issues 278
7.4 bibliographical discussion 279
8 text classification 281
with marcos gon?calves
8.1 introduction 281
8.2 a characterization of text classification 282
8.2.1 machine learning 282
8.2.2 the text classification problem 283
8.2.3 text classification algorithms 284
8.3 unsupervised algorithms 286
8.3.1 clustering 286
8.3.2 naive text classification 290
8.4 supervised algorithms 291
8.4.1 decision trees 294
8.4.2 the k-nn classifier 299
8.4.3 the rocchio classifier 300
8.4.4 probabilistic naive bayes document classification 303
8.4.5 the svm classifier 306
8.4.6 ensemble classifiers 316
8.4.7 final remarks on supervised algorithms 319
8.5 feature selection or dimensionality reduction 320
8.5.1 term–class incidence table 321
8.5.2 term document frequency 322
8.5.3 tf-idf weights 322
8.5.4 mutual information 323
8.5.5 information gain 323
8.5.6 chi square 324
8.5.7 impact of feature selection 325
8.6 evaluation metrics 325
8.6.1 contingency table 325
8.6.2 accuracy and error 326
8.6.3 precision and recall 327
8.6.4 f-measure and f1 327
8.6.5 cross-validation 329
8.6.6 standard collections 329
8.7 organizing the classes – building taxonomies 330
xx contents
8.8 trends and research issues 333
8.9 bibliographic discussion 334
9 indexing and searching 337
with gonzalo navarro
9.1 introduction 337
9.2 inverted indexes 340
9.2.1 basic concepts 340
9.2.2 full inverted indexes 341
9.2.3 searching 345
9.2.4 ranking 348
9.2.5 construction 351
9.2.6 compressed inverted indexes 354
9.2.7 structural queries 357
9.3 signature files 357
9.4 suffix trees and suffix arrays 360
9.4.1 structure: tries and suffix trees 361
9.4.2 searching for simple strings 362
9.4.3 searching for complex patterns 363
9.4.4 construction 365
9.4.5 compressed suffix arrays 367
9.5 sequential searching 372
9.5.1 simple strings: horspool 373
9.5.2 complex patterns: automata and bit-parallelism 375
9.5.3 faster bit-parallel algorithms 379
9.5.4 regular expressions 382
9.5.5 multiple patterns 384
9.5.6 approximate searching 385
9.5.7 searching compressed text 389
9.6 multi-dimensional indexing 391
9.7 trends and research issues 393
9.8 bibliographic discussion 394
10 parallel and distributed ir 399
with eric brown
10.1 introduction 399
10.2 a taxonomy of distributed ir systems 402
10.3 data partitioning 404
10.3.1 collection partitioning 405
10.3.2 collection selection 407
10.3.3 inverted index partitioning 409
10.3.4 partitioning other indexes 413
10.4 parallel ir 414
10.4.1 introduction 414
10.4.2 parallel ir on mimd architectures 416
10.4.3 parallel ir on simd architectures 418
10.5 cluster-based ir 423
10.6 distributed ir 424
contents xxi
10.6.1 introduction 424
10.6.2 indexing 428
10.6.3 query processing 431
10.6.4 web issues 437
10.7 federated search 438
10.8 retrieval in peer-to-peer networks 440
10.9 trends and research issues 444
10.10bibliographic discussion 445
11 web retrieval 447
with yoelle maarek
11.1 introduction 447
11.2 a challenging problem 449
11.3 the web 451
11.3.1 characteristics 451
11.3.2 structure of the web graph 452
11.3.3 modeling the web 454
11.3.4 link analysis 456
11.4 search engine architectures 458
11.4.1 basic architecture 458
11.4.2 cluster-based architecture 459
11.4.3 caching 462
11.4.4 multiple indexes 464
11.4.5 distributed architectures 466
11.5 search engine ranking 468
11.5.1 ranking signals 469
11.5.2 link-based ranking 470
11.5.3 simple ranking functions 473
11.5.4 learning to rank 473
11.5.5 learning the ranking function 474
11.5.6 quality evaluation 475
11.5.7 web spam 476
11.6 managing web data 477
11.6.1 assigning identifiers to documents 477
11.6.2 metadata 478
11.6.3 compressing the web graph 478
11.6.4 handling duplicated data 479
11.7 search engine user interaction 480
11.7.1 the search rectangle paradigm 481
11.7.2 the search engine result page 488
11.7.3 educating the user 497
11.8 browsing 498
11.8.1 flat browsing 499
11.8.2 structure guided browsing and web directories 499
11.9 beyond browsing 501
11.9.1 hypertext and the web 501
11.9.2 combining searching with browsing 501
11.9.3 web query languages 503
xxii contents
11.9.4 dynamic search 503
11.10related problems 504
11.10.1 computational advertising 504
11.10.2web mining 506
11.10.3 metasearch 508
11.11trends and research issues 509
11.11.1 beyond static text data 509
11.11.2 current challenges 511
11.12bibliographical discussion 513
12 web crawling 515
with carlos castillo
12.1 introduction 515
12.2 applications of a web crawler 517
12.2.1 general web search 517
12.2.2 topical crawling 518
12.2.3 web characterization 518
12.2.4 mirroring 518
12.2.5 web site analysis 519
12.3 a taxonomy of crawlers 519
12.3.1 types of web pages 520
12.4 architecture and implementation 521
12.4.1 crawler architecture 521
12.4.2 practical issues 523
12.4.3 parallel crawling 526
12.5 scheduling algorithms 527
12.5.1 selection policy 528
12.5.2 revisit policy 530
12.5.3 politeness policy 535
12.5.4 combining policies 538
12.6 evaluation 539
12.6.1 evaluating network usage 539
12.6.2 evaluating long-term scheduling 540
12.7 trends and research issues 541
12.7.1 crawling the “hidden” web 541
12.7.2 crawling with the help of web sites 542
12.7.3 distributed crawling 543
12.8 bibliographic discussion 543
13 structured text retrieval 545
with mounia lalmas
13.1 introduction 545
13.2 structuring power 546
13.2.1 explicit vs. implicit structure 546
13.2.2 static vs. dynamic structure 547
13.2.3 single hierarchy vs. multiple hierarchies 548
13.3 early text retrieval models 549
13.3.1 model based on non-overlapping lists 549
contents xxiii
13.3.2 model based on proximal nodes 550
13.3.3 ranking structured text results 551
13.4 xml retrieval 551
13.4.1 challenges in xml retrieval 551
13.4.2 indexing strategies 553
13.4.3 ranking strategies 554
13.4.4 removing overlaps 565
13.5 xml retrieval evaluation 566
13.5.1 document collections 566
13.5.2 topics 567
13.5.3 retrieval tasks 568
13.5.4 relevance 569
13.5.5 measures 571
13.6 query languages 573
13.6.1 characteristics 574
13.6.2 classification of xml query languages 575
13.6.3 examples of xml query languages 577
13.7 trends and research issues 582
13.8 bibliographic discussion 585
14 multimedia information retrieval 587
by dulce poncele′on and malcolm slaney
14.1 introduction 587
14.1.1 what is multimedia? 587
14.1.2 multimedia ir 588
14.1.3 text ir versus multimedia ir 589
14.2 the challenges 589
14.2.1 the semantic gap 589
14.2.2 feature ambiguity 591
14.2.3 machine-generated data 591
14.3 content-based image retrieval 592
14.3.1 color-based retrieval 593
14.3.2 texture 593
14.3.3 salient points 596
14.4 audio and music retrieval 597
14.4.1 fingerprinting 598
14.4.2 speech recognition 599
14.4.3 speaker identification 601
14.4.4 spoken document retrieval 602
14.4.5 audio basics 602
14.5 retrieving and browsing video 606
14.5.1 video abstracts 606
14.5.2 static summaries 607
14.5.3 mosaics and salient stills 608
14.5.4 dynamic summaries 609
14.5.5 interactive summaries 611
14.5.6 visual vs. audio browsing 612
14.5.7 evaluating summaries 613
xxiv contents
14.6 fusion models: combining it all 614
14.6.1 naming faces 614
14.6.2 naming images 615
14.6.3 naming audio 616
14.6.4 combining audio and video for avsr 617
14.6.5 combining audio and video for multimedia 620
14.7 segmentation 620
14.7.1 a video segmentation example 620
14.7.2 segmentation schemes for video 622
14.7.3 video segmentation with edges 623
14.7.4 speech segmentation 624
14.7.5 segmentation evaluation 625
14.8 compression and mpeg standards 625
14.8.1 intensity and sampling 626
14.8.2 color 626
14.8.3 lossy compression 628
14.8.4 lossless compression 628
14.8.5 temporal redundancy 630
14.8.6 motion prediction 631
14.8.7 mpeg standards 633
14.9 trends and research issues 636
14.10bibliographic discussion 637
15 enterprise search 641
by david hawking
15.1 introduction 641
15.1.1 characteristics and applications of enterprise search 642
15.1.2 enterprise search software 643
15.1.3 workplace search 644
15.2 enterprise search tasks 644
15.2.1 examples of search-supported tasks 644
15.2.2 search types 647
15.2.3 studying enterprise search 647
15.3 architecture of enterprise search systems 648
15.3.1 gathering 648
15.3.2 extracting 651
15.3.3 indexing 652
15.3.4 indexing textual annotations 653
15.3.5 query processing 654
15.3.6 presentation of search results 655
15.3.7 security models 657
15.3.8 federation/metasearch 659
15.4 enterprise search evaluation 662
15.4.1 published test collections for enterprise search 662
15.4.2 internal enterprise search evaluations 663
15.4.3 enterprise search tuning 665
15.4.4 what is it reasonable to expect? 666
15.5 potential reasons for dissatisfaction 667
contents xxv
15.6 context and personalization 668
15.6.1 controls and levers for contextualization 671
15.6.2 contextualization: local, enterprise or global? 675
15.6.3 privacy of profiles 676
15.6.4 defining, creating and maintaining a profile 677
15.6.5 user modeling 677
15.6.6 implicit measures 679
15.6.7 information filtering 679
15.6.8 social recommender systems 680
15.7 trends and research issues 681
15.8 bibliographic discussion 681
16 library systems 685
by edie rasmussen
16.1 the information environment in the library 685
16.2 online public access catalogues 687
16.2.1 opacs and bibliographic records 689
16.2.2 information retrieval from the ils 691
16.2.3 integrating the hybrid library 693
16.2.4 opacs and end users 694
16.2.5 ils: vendors and products 695
16.3 ir systems and document databases 697
16.3.1 bibliographic and full-text databases 698
16.3.2 content of database records 698
16.3.3 the online industry: database vendors 701
16.3.4 information retrieval from document databases 702
16.4 information retrieval in organizations 706
16.5 trends and research issues 708
16.6 bibliographic discussion 709
17 digital libraries 711
by marcos gon?calves
17.1 introduction 711
17.2 defining digital libraries 712
17.3 a general architecture 713
17.4 fundamentals 714
17.4.1 digital objects and collections 714
17.4.2 metadata and catalogs 716
17.4.3 repositories/archives 719
17.4.4 services 723
17.5 social-economical issues 725
17.5.1 social issues 725
17.5.2 economical issues 726
17.6 software systems 727
17.6.1 greenstone 728
17.6.2 eprints 728
17.6.3 dspace 728
17.6.4 fedora 729
xxvi contents
17.6.5 open digital libraries 729
17.6.6 the 5s suite 730
17.7 dl case studies 731
17.7.1 the networked dl of theses and dissertations 731
17.7.2 the national science digital library 732
17.7.3 the etana-dl archaeological digital library 732
17.8 trends and research issues 733
17.8.1 evaluation 733
17.8.2 integration 733
17.8.3 other research challenges 734
17.9 bibliographic discussion 735
a open source search engines 737
with christian middleton
a.1 introduction 737
a.2 search engines 738
a.2.1 preliminary selection of search engines 738
a.2.2 features 741
a.2.3 evaluation 742
a.3 methodology 743
a.3.1 document collections 743
a.3.2 evaluation tests 744
a.3.3 experimental setup 744
a.4 experimental results 745
a.4.1 test a – indexing 745
a.4.2 test b – incremental indexing 749
a.4.3 test c – search performance 749
a.4.4 global evaluation 752
a.5 conclusions 753
b biographies 755
references 761
index 893
contents xxvii