面向时间敏感对象的垂直搜索引擎关键技术研究
详细信息    本馆镜像全文|  推荐本文 |  |   获取CNKI官网全文
摘要
随着搜索服务的逐渐普及深化,用户针对特定领域的搜索需求逐渐明确、对搜索结果的个性化程度和实时性要求逐渐增高,使得基于垂直搜索领域的高效信息检索服务已成为搜索引擎市场的热点。垂直搜索引擎通过聚焦抓取、智能调度、高维索引等技术,根据特定的领域知识和用户的搜索习惯,为用户提供特定垂直领域中时效性更高,更为个性化、专业化的搜索结果。
     然而现有大多数的垂直搜索引擎中存在1)爬虫系统抓取模式被动,目标抓取与用户查询时延过长;2)爬虫系统抓取调度盲目,抓取资源利用率低;以及3)索引系统性能低下,对特定文本信息的特征提取与聚类缺乏有效算法等问题,已经严重地桎梏了垂直搜索引擎市场的健康发展。本文试图对这些热点问题及其关键技术进行系统性研究。本文的主要贡献和创新点如下:
     1.爬虫系统的主动式聚焦抓取技术研究
     针对爬虫系统抓取模式被动,目标抓取与用户查询时延过长的问题,提出了语义驱动的查询驱动聚焦抓取技术,基于领域知识理解用户查询,提供了查询向目标网页的语义转换,实现了针对用户查询的主动式抓取,解决了目标抓取与用户查询时延过长的问题。充分的实验和在真实项目中的初步应用表明,查询驱动聚焦抓取技术为用户提供了10秒级搜索结果,大大降低了时延,极大的提高了用户体验。
     2.爬虫系统的智能调度技术研究
     针对爬虫系统抓取调度盲目、利用率低的问题,基于网页文档变化的泊松过程建模,在对单个对象新鲜度进行定量估算的基础上,提出对象级细粒度资源调度算法PoissonRank,实现了针对变化的抓取调度,极大地提高了抓取资源的利用率。仿真分析和商用项目中的应用表明了该模型的有效性,该调度技术能提高抓取资源利用率并更好的捕捉对象的变化。大量真实环境中的实验验证了对象分布规律和泊松过程建模的正确性以及用户体验的提升,同时PoissonRank对系统带来的额外开销很低,具有很高的可扩展性。
     3.索引系统中高维索引的在线更新技术研究
     针对索引系统中多媒体高维索引在线更新效率低下的问题,对高维索引中的LSH算法进行优化,提出了基于压缩位图(Compressed Bitmap)的CB-LSH高维索引技术,通过算子布尔代数化后引入压缩位图索引对LSH的添删改性能进行了全面提升,解决了高维索引在线更新的性能问题。理论分析证明了CB-LSH在空间占用和时间复杂度上的改善;大量真实数据上的实验结果表明,与现有的LSH算法相比,CB-LSH节约了三分之一的内存占用,删除性能提高了近一个数量级,查询性能提高了数倍,插入性能提高了约一半;真实项目验证了CB-LSH应用于在线实时更新的海量多媒体对象检索系统中是有效可行的。
     4.索引系统中文本信息的结果合并技术研究
     针对垂直领域中文本信息长度短、专业性强、噪音多,索引系统中聚类效果差的问题,提出了一种基于自然语言触发对的文本聚类技术TrigSigs,基于一阶触发对充分挖掘词汇隐含属性的关联关系,学习领域专业词汇、去除噪音词汇并提取关键特征词汇,实现了细粒度对象级聚类技术。仿真实验表明,该算法可以过滤绝大部分噪音词汇,并且根据词汇的分辨力合理分配权重,使最终聚类结果的准确率有很大的提升。
With the more and more popularity of search engine services, domain-related search requests become more and more clear. The requirements for personal search and recency-sensitive search gradually heightened. As a result, efficient information retrieval based on vertical search engines has become the issues of the search engine domain. By using fo-cused crawling, intelligent scheduling and high-dimensional indexing techniques, as well as based on domain knowledge and personality, vertical search engines provides up to date, more personality-aware and more professional search results.
     However, the major problems exist in most vertical search engines are as follows: (1) the passive crawling mode for crawler system results in a long delay between user query and result retrieval. (2) the scheduler of crawler system schedules web page crawling driftless, which makes a very low utilization for crawling resources. (3) the performance of indexing system is not settle for online updates, and the merging results for certain unstructured text objects are terrible. This paper conducts fully study of these problems as well as the related key technologies.
     The major contributions of the paper are presented in the following:
     Firstly, it proposes a semantic based query triggered crawling (QTC) technique to settle the problem of long delay between user query and result retrieval caused by passive crawlers. Based on domain knowledge, QTC translates user query to request parameters of potential target results on domain web sites, and implements an active crawling technique focused on current user queries to solve the problem. Extensive experiments and beta test in real commercial applications show that QTC bridges the delay gap between user query and result retrieval, and brings 10-second-level freshness in vertical search results.
     Secondly, it proposes an object-level change-aware resource scheduling technique to settle the problem of low utilization of crawling resources caused by crawling blindly. This technique named Poisson-Rank which uses Poisson process to model the time of web ob-ject changing sequence. The Poisson process model provides a quantitative estimation of object-level freshness. By scheduling the crawler resources according to estimated object freshness, this technique not only improves the resource utilization but also captures the changing rule for objects more accurate. Extensive experiments in real data show the ac-curacy of object freshness estimation for Poisson process model, and improved resource utilization with nearly zero-extra-costs in performance.
     Thirdly, it proposes a more efficient high-dimensional indexing technique to address the performance problem of traditional high-dimensional indexing methods. This tech-nique named CB-LSH combines Compressed Bitmap index and Locality-Sensitive Hash-ing index. CB-LSH booleanizes each operator in LSH index and brings CB into LSH. CB-LSH greatly improves the performance and solved the online update problem for high-dimensional indexing. Theoretical analysis proves the improvements. Extensive experi-ments show that CB-LSH achieves 1/3 less memory usage,10 times of index deletion performance,4 times of query performance and 1.5 times of insert performance. Applica-tions in real commercial projects showed that CB-LSH is feasible for online updates in a large image retrieval system.
     Fourthly, it proposes a text clustering technique inspired by trigger-pairs in natural language to improve the clustering results of traditional text clustering algorithms for un-structured text data. Unstructured text data in e-commerce has the properties of very short length, noisy and professional vocabulary, which make the traditional text clustering al-gorithms useless. Trigger-pair based clustering technique (TrigSigs) uncovers hidden re-lations between words, adapts professional vocabulary and extracts key word features to enable a fine-granularity object level clustering technique. Simulation experiments show that this technique could filter out most noises, make efficient weight distribution between word features and greatly improve the-clustering results.
引文
[1]徐莹.搜索引擎技术及其发展前瞻[J].科技情报开发与经济,2005,15(024):177-178.
    [2]LEWANDOWSKI D. A three-year study on the freshness of web search engine databases[J]. J. Inf. Sci,2008,34(6):817-831.
    [3]谢红薇,颜小林,余雪丽.基于本体的Web页面聚类研究[J].计算机科学,2008,35(009):153-155.
    [4]肖欣延,张东站,高君杰,et al.一种新的Web检索结果聚类方法[J].计算机研究与发展,2007,35(Oz2):79-83.
    [5]张健沛,刘洋,杨静,et al.搜索引擎结果聚类算法研究[J].计算机工程,2004,30(005):95-97.
    [6]INDYK P, MOTWANI R. Approximate nearest neighbors:towards removing the curse of dimensionality [C]//STOC'98:Proceedings of the thirtieth annual ACM symposium on Theory of computing. New York, NY, USA:ACM,1998:604-613.
    [7]KOEHLER W. Web page change and persistence-A four-year longitudinal study [J]. Journal of the American Society for Information Science and Technol-ogy,2002,53(2):162-171.
    [8]KIM S, LEE S. An empirical study on the change of Web pages[J]. Web Technolo-gies Research and Development,2005,34:632-642.
    [9]KIM Y S, KANG B H, COMPTON P, et al. Search engine retrieval of changing information[C]//WWW'07:Proceedings of the 16th international conference on World Wide Web. New York, NY, USA:ACM,2007:1195-1196.
    [10]TOYODA M, KITSUREGAWA M. What's really new on the web?:identifying new pages from a series of unstable web snapshots [C]//Proceedings of the 15th international conference on World Wide Web. ACM.2006:233-241.
    [11]KANG K, SON S, STANKOVIC J. Differentiated real-time data services for e-commerce applications [J]. Electronic Commerce Research,2003,3(1):113-142.
    [12]FETTERLY D, MANASSE M, NAJORK M, et al. A large-scale study of the evo-lution of Web pages[J]. Software:Practice and Experience,2004,34(2):213-237.
    [13]吴伟忠,崔建英.基于时效性的垂直搜索及其应用[J].暨南大学学报:自然科学与医学版,2007,28(003):255-258.
    [14]BAR-ELAN J. Search engine ability to cope with the changing Web[J]. Web Dynam-ics:Adapting to Change in Content, Size, Topology and Use,2006,23:195-215.
    [15]BREWINGTON B, CYBENKO G. How dynamic is the Web? 1[J]. Computer Networks,2000,33(1-6):257-276.
    [16]BAR-YOSSEF Z, GUREVICH M. Efficient search engine measurements[C]//Pro-ceedings of the 16th international conference on World Wide Web. ACM.2007: 401-410.
    [17]DAKKA W, GRAVANO L, IPEIROTIS P. Answering general time sensitive queries[C]//Proceeding of the 17th ACM conference on Information and knowl-edge management. ACM.2008:1437-1438.
    [18]CHO J, GARCIA-MOLINA H. Synchronizing a database to improve freshness[J]. ACM SIGMOD Record,2000,29(2):117-128.
    [19]TAN Q, MITRA P, GILES C. Designing clustering-based web crawling policies for search engine crawlers[C]//Proceedings of the sixteenth ACM conference on Conference on information and knowledge management. ACM.2007:535-544.
    [20]DASDAN A, HUYNH X. User-centric content freshness metrics for search en-gines[C]//Proceedings of the 18th international conference on World wide web. ACM.2009:1129-1130.
    [21]李强.基于本体论的个性化和社会化元搜索引擎的研究[D]. Master's thesis.浙江大学,2006.
    [221刘妮娜.Web数据挖掘和个性化搜索引擎研究[D]. Master's thesis.浙江大学,2005.
    [23]王继成,萧嵘,孙正兴,et al. Web信息检索研究进展[J].计算机研究与发展,2001,38(002):187-193.
    [24]MCBRYAN O. GENVL and WWWW:Tools for Taming the Web[C]//Proceedings of the First International World Wide Web Conference, vol 341. Citeseer.1994.
    [25]PAGE L, BRIN S, MOTWANI R, et al. The pagerank citation ranking:Bringing order to the web[J]. Stanford Digital Library Technologies,1998.
    [26]BRIN S, PAGE L. The anatomy of a large-scale hypertextual Web search engine[J]. Computer networks and ISDN systems,1998,30(1-7):107-117.
    [27]YANG H, DASDAN A, HSIAO R, et al. Map-reduce-merge:simplified relational data processing on large clusters [C]//Proceedings of the 2007 ACM SIGMOD in-ternational conference on Management of data. ACM.2007:1029-1040.
    [28]DEAN J, GHEMAWAT S. MapReduce:Simplified data processing on large clus-ters[J]. Communications of the ACM,2008,51(1):107-113.
    [29]CHU C, KIM S, LIN Y, et al. Map-reduce for machine learning on multicore[C]// Advances in Neural Information Processing Systems 19:Proceedings of the 2006 Conference. The MIT Press.2007:281.
    [30]CHANG F, DEAN J, GHEMAWAT S, et al. Bigtable:A distributed storage system for structured data[J]. ACM Transactions on Computer Systems (TOCS),2008, 26(2):1-26.
    [31]CHOWDHURY G, (LONDRES) L A. Introduction to modern information re-trieval [M]. Facet,2004.
    [32]CHAU M, CHEN H. Comparison of three vertical search spiders[J]. Computer, 2003,36(5):56-62.
    [33]刘畅.综合搜索引擎与垂直搜索引擎的比较研究[J].情报科学,2007,25(001):97-102.
    [34]NIE Z, MA Y, SHI S, et al. Web object retrieval[C]//Proceedings of the 16th inter-national conference on World Wide Web. ACM.2007:81-90.
    [35]CHAKRABARTI S, VAN DEN BERG M, DOM B. Focused crawling:a new ap-proach to topic-specific Web resource discovery [J]. Computer Networks,1999, 31(11-16):1623-1640.
    [36]周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(009):1965-1969.
    [37]NTOULAS A, CHO J, OLSTON C. What's new on the web?:the evolution of the web from a search engine perspective[C]//Proceedings of the 13th international conference on World Wide Web. ACM.2004:1-12.
    [38]GULLI A, SIGNORINI A. The indexable web is more than 11.5 billion pages[C]// Special interest tracks and posters of the 14th international conference on World Wide Web. ACM.2005:902-903.
    [39]ALBERT R, JEONG H, BARABASI A. The diameter of the world wide web[J]. Arxiv preprint cond-mat/9907038,1999.
    [40]LYMAN P, VARIAN H, SWEARINGEN K, et al. How much information? 2003[M]. School of Information Management and Systems, University of California at Berkeley,2003.
    [41]DONG A, CHANG Y, ZHENG Z, et al. Towards recency ranking in web search[C]// Proceedings of the third ACM international conference on Web search and data mining. ACM.2010:11-20.
    [42]LANGVILLE A, MEYER C, FERNANDEZ P. Google's PageRank and beyond: the science of search engine rankings[J]. The Mathematical Intelligencer,2008, 30(1):68-69.
    [43]LEWANDOWSKI D. A three-year study on the freshness of web search engine databases[J]. JOURNAL OF INFORMATION SCIENCE,2008,34(6):817-831.
    [44]BREWINGTON B, CYBENKO G. Keeping up with the changing web[J]. Com-puter,2002,33(5):52-58.
    [45]OLSTON C, PANDEY S. Recrawl scheduling based on information longevity[C]// Proceeding of the 17th international conference on World Wide Web. ACM.2008: 437-46.
    [46]FISCHER W, MEIER-HELLSTERN K. The Markov-modulated Poisson process (MMPP) cookbook[J]. Performance Evaluation,1993,18(2):149-171.
    [47]SEBE N L M S. Robust Color Indexing[C]//Proceedings of the 7th ACM Interna-tional Conference on Multimedia. Citeseer 1999:239-242.
    [48]STRICKER M, ORENGO M. Similarity of color images[C]//Proc. SPIE Storage and Retrieval for Image and Video Databases, vol 2420. Citeseer.1995:381-392.
    [49]SMITH J, CHANG S. Tools and techniques for color image retrieval [J]. Storage & Retrieval for Image and Video Databases Ⅳ,1996,2670:426-437.
    [50]PASS G, ZABIH R, MILLER J. Comparing images using color coherence vec-tors[C]//Proceedings of the fourth ACM international conference on Multimedia. ACM.1997:65-73.
    [51]HUANG J, KUMAR S, MITRA M, et al. Image indexing using color correlo-grams[C]//Proceedings of the 1997 Conference of Computer Vision and Pattern Recognition.1997:762-768.
    [52]HARALICK R, SHANMUGAM K, DINSTEIN I. Textural features for image clas-sification [J]. IEEE Transactions on systems, man and cybernetics,1973,3(6):610-621.
    [53]TAMURA H, MORI S, YAMAWAKI T. Textural features corresponding to vi-sual perception [J]. Systems, Man and Cybernetics, IEEE Transactions on,2007, 8(6):460-473.
    [54]CROSS G, JAIN A. Markov random field texture models[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on,2009,8(1):25-39.
    [55]RANDEN T, HUSOY J. Filtering for texture classification:A comparative study[J]. Pattern Analysis and Machine Intelligence, IEEE Transactions on,2002,21(4):291-310.
    [56]HU M. Visual pattern recognition by moment invariants[J]. Information Theory, IRE Transactions on,2002,8(2):179-187.
    [57]LOWE D. Object recognition from local scale-invariant features[C]//iccv. Pub-lished by the IEEE Computer Society.1999:1150.
    [58]MOKHTARIAN F, ABBASI S, KITTLER J. Efficient and robust retrieval by shape content through curvature scale space[J]. Series on Software Engineering and Knowledge Engineering,1997,8:51-58.
    [59]ISAACS J, ASLAM J. Investigating measures for pairwise document similarity[M]. Citeseer,1999.
    [60]LIN D. An information-theoretic definition of similarity[C]//Proceedings of the 15th International Conference on Machine Learning, vol 1. Citeseer.1998:296-304.
    [61]JAIN A, MURTY M, FLYN.N P. Data clustering:a review[J]. ACM computing surveys (CSUR),1999,31(3):264-323.
    [62]KATAYAMA N, SATOH S. The sr-tree:an index structure for high-dimensional nearest neighbor queries[C]//SIGMOD'97:Proceedings of the 1997 ACM SIG-MOD international conference on Management of data. New York, NY, USA:ACM, 1997:369-380.
    [63]WEBER R, SCHEK H, BLOTT S. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces[C]//Proceedings of the International Conference on Very Large Data Bases. INSTITUTE OF ELECTRI-CAL& ELECTRONICS ENGINEERS.1998:194-205.
    [64]WEBER R, SCHEK H, BLOTT S. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces[C]//Proceedings of the International Conference on Very Large Data Bases. INSTITUTE OF ELECTRI-CAL & ELECTRONICS ENGINEERS.1998:194-205.
    [65]PANIGRAHY R. Entropy based nearest neighbor search in high dimensions[C]// Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algo-rithm. ACM.2006:1186-1195.
    [66]CHAVEZ E, NAVARRO G, BAEZA-YATES R, et-al. Searching in metric spaces[J]. ACM Computing Surveys (CSUR),2001,33(3):273-321.
    [67]HJALTASON G, SAMET H. Index-driven similarity search in metric spaces (Survey Article)[J]. ACM Transactions on Database Systems (TODS),2003,28(4):517-580.
    [68]DATAR M, IMMORLICA N, INDYK P, et al. Locality-sensitive hashing scheme based on p-stable distributions[C]//SCG'04:Proceedings of the twentieth annual symposium on Computational geometry. New York, NY, USA:ACM,2004:253-262.
    [69]PANIGRAHY R. Entropy based nearest neighbor search in high dimensions[C]// SODA'06:Proceedings of the seventeenth annual ACM-SIAM symposium on Dis-crete-algorithm. New York, NY, USA:ACM,2006:1186-1195.
    [70]BAWA M, CONDIE T, GANESAN P. Lsh forest:self-tuning indexes for similarity search[C]//WWW'05:Proceedings of the 14th international conference on World Wide Web. New York, NY, USA:ACM,2005:651-660.
    [71]DONG W, WANG Z, JOSEPHSON W, et al. Modeling 1sh for performance tun-ing[C]//CIKM'08:Proceeding of the 17th ACM conference on Information and knowledge management. New York, NY, USA:ACM,2008:669-678.
    [72]LV Q, JOSEPHSON W, WANG Z, et al. Multi-probe 1sh:efficient indexing for high-dimensional similarity search[C]//VLDB'07:Proceedings of the 33rd international conference on Very large data bases. VLDB Endowment 2007:950-961.
    [73]GIONIS A, INDYK P, MOTWANI R. Similarity search in high dimensions via hashing[C]//VLDB'99:Proceedings of the 25th International Conference on Very Large Data Bases. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc., 1999:518-529.
    [74]WU Y, SHOU L, HU T, et al. Query triggered crawling strategy:Build a time sen-sitive vertical search engine[C]//Proceedings of the 2008 International Conference on Cyberworlds. vol 0. Los Alamitos, CA, USA:IEEE Computer Society,2008: 422-27.
    [75]CHAKRABARTI S. Data mining for hypertext:A tutorial survey[J]. ACM SIGKDD Explorations Newsletter,2000,1(2):1-11.
    [76]HAN J, KAMBER M. Data mining:concepts and techniques[M]. Morgan Kauf-mann,2006.
    [77]BISHOP C. Neural networks for pattern recognition [M]. Oxford University Press, USA,1995.
    [78]JAIN A, DUIN R, MAO J. Statistical pattern recognition:A review[J]. IEEE Trans-actions on pattern analysis and machine intelligence,2000,22(1):4-37.
    [79]SKLANSKY J, SIEDLECKI W. Large-scale feature selection[J]. Handbook of Pattern Recognition and Computer Vision,1993,21:61-124.
    [80]BOHM C, BERCHTOLD S, KEIM D. Searching in high-dimensional spaces:In-dex structures for improving the performance of multimedia databases [J]. ACM Computing Surveys (CSUR),2001,33(3):322-373.
    [81]DUBES R. Cluster analysis and related issues[C]//Handbook of pattern recognition & computer vision. World Scientific Publishing Co., Inc.1993:32.
    [82]BOLSHAKOVA N, AZUAJE F. Cluster validation techniques for genome expres-sion data[J]. Signal processing,2003,83(4):825-833.
    [83]KLEINBERG J. An impossibility theorem for clustering[C]//Advances in Neural Information Processing Systems 15:Proceedings of the 2002 Conference. The MIT Press.2003:463.
    [84]JAIN A, DUBES R. Algorithms for clustering data[M]. Prentice-Hall Advanced Reference Series,1988.
    [85]UEHARA M, SATO N, SAKAIY. Adaptive calculation of scores for fresh infor-mation retrieval [J]. Parallel and Distributed Systems, International Conference on, 2005,1:750-755.
    [86]RISVIK K, MICHELSEN R. Search engines and web dynamics[J]. Computer Networks,2002,39(3):289-302.
    [87]CHO J, GARCIA-MOLINA H. The evolution of the web and implications for an in-cremental crawler[C]//VLDB'00:Proceedings of the 26th International Conference on Very Large Data Bases. San Francisco, CA, USA:Morgan Kaufmann Publishers Inc.,2000:200-209.
    [88]TAN Q, ZHUANG Z, MITRA P, et al. Designing efficient sampling techniques to detect webpage updates[C]//WWW'07:Proceedings of the 16th international conference on World Wide Web. New York, NY, USA:ACM,2007:1147-1148.
    [89]KUKULENZ D, NTOULAS A. Answering bounded continuous search queries in the world wide web[C]//WWW'07:Proceedings of the 16th international confer-ence on World Wide Web. New York, NY, USA:ACM,2007:551-560.
    [90]NIE Z, WEN J, MA W. Object-level vertical search[C]//To appear by the Third Biennial Conference on Innovative Data Systems Research (CIDR). Citeseer.2007.
    [91]金芝.基于本体的需求自动获取[J].计算机学报,2000,23(005):486492.
    [92]GINSBERG M. Knowledge interchange format:The KIF of death[J]. AI magazine, 1991,12(3):57.
    [93]GRUBER T, LABORATORY S U K S. Ontolingua:A mechanism to support portable ontologies[M]. Citeseer,1992.
    [94]FARQUHAR A, FIKES R, RICE J. The ontolingua server:A tool for collaborative ontology construction [J]. International Journal of Human-Computers Studies,1997, 46(6):707-727.
    [95]KLEINBERG J. Authoritative sources in a hyperlinked environment[J]. Journal of the ACM (JACM),1999,46(5):604-632.
    [96]BALMIN A, HRISTIDIS V, PAPAKONSTANTINOU Y. Objectrank:Authority-based keyword search in databases[C]//Proceedings of the Thirtieth international conference on Very large data bases-Volume 30. VLDB Endowment.2004:564-575.
    [97]GUO L, SHAO F, BOTEV C, et al. XRANK:Ranked keyword search over XML documents[C]//Proceedings of the 2003 ACM SIGMOD international conference on Management of data. ACM.2003:16-27.
    [98]BOUZEGHOUB M. A framework for analysis of data freshness[C]//Proceedings of the 2004 international workshop on Information quality in information systems. ACM.2004:59-67.
    [99]JARKE M, JEUSFELD M, QUIX C, et al. Architecture and quality in data warehouses:An extended repository approach* 1[J]. Information Systems,1999, 24(3):229-253.
    [100]SEGEV A, FANG W. Currency-based updates to distributed materialized views [C]// Data Engineering,1990. Proceedings. Sixth International Conference on. IEEE. 2002:512-520.
    [101]WANG R, STRONG D. Beyond accuracy:What data quality means to data con-sumers[J]. Journal of management information systems,1996,12(4):5-33.
    [102]NAUMANN F, LESER U, FREYTAG J. Quality-driven integration of hetero-geneous information systems [C]//PROCEEDINGS OF THE INTERNATIONAL CONFERENCE ON VERY LARGE DATA BASES. Citeseer.1999:447-458.
    [103]OLSTON C, WIDOM J. Offering a precision-performance tradeoff for aggrega-tion queries over replicated data[C]//Proceedings of the Twenty-Sixth International Conference on Very Large Data Bases. Citeseer.2000.
    [104]CHO J, GARCIA-MOLINA H. Effective page refresh policies for web crawlers[J]. ACM Transactions on Database Systems (TODS),2003,28(4):390-426.
    [105]JOHANSEN S, JUSELIUS K. Maximum likelihood estimation and inference on cointegration-with applications to the demand for money[J]. Oxford Bulletin of Economics and statistics,1990,52(2):169-210.
    [106]周佳庆,吴羽,江锦华,et al.一种实时垂直搜索引擎对象缓存优化策略[J].浙江大学学报工学版,2010.
    [107]HANLEY J, MCNEIL B. The meaning and use of the area under a receiver operating characteristic (ROC) curve.[J]. Radiology,1982,143(1):29.
    [108]BRADLEY A. The use of the area under the ROC curve in the evaluation of machine learning algorithms [J]. Pattern Recognition,1997,30(7):1145-1159.
    [109]程守远.基于图像检索技术的领带花型检索的研究[D]. Master's thesis.东华大学,2006.
    [110]CHAN C Y, IOANNIDIS Y E. Bitmap index design and evaluation[C]//SIGMOD '98:Proceedings of the 1998 ACM SIGMOD international conference on Manage-ment of data. New York, NY, USA:ACM,1998:355-366.
    [111]WU K, OTOO E J, SHOSHANI A. A performance comparison of bitmap in-dexes[C]//CIKM'01:Proceedings of the tenth international conference on Infor-mation and knowledge management. New York, NY, USA:ACM,2001:559-561.
    [112]WU K, OTOO E J, SHOSHANI A. Compressing bitmap indexes for faster search operations[C]//SSDBM'02:Proceedings of the 14th International Conference on Scientific and Statistical Database Management. Washington, DC, USA:IEEE Computer Society,2002:99-108.
    [113]ANTOSHENKOV G, ZIAUDDIN M. Query processing and optimization in oracle rdb[J]. The VLDB Journal,1996,5(4):229-237.
    [114]WU K, OTOO E, SHOSHANI A. An efficient compression scheme for bitmap indices [M]. Citeseer,2004.
    [115]WU K, OTOO E J, SHOSHANI A. Optimizing bitmap indices with efficient com-pression[J]. ACM Trans. Database Syst.,2006,31(1):1-38.
    [116]LAU R, ROSENFELD R, ROUKOS S. Trigger-based language models:A max-imum entropy approach [C]//Acoustics, Speech, and Signal Processing,1993. ICASSP-93.,1993 IEEE International Conference on. vol 2. IEEE.2002:45-48.
    [117]赵岩,王晓龙,刘秉权,et al.融合聚类触发对特征的最大熵词性标注模型[J].计算机研究与发展,2006,43(002):268-274.
    [118]BAEZA-YATES R, RIBEIRO-NETO B, OTHERS. Modern information re-trieval [M]. ACM press New York,1999.
    [119]XU R, WUNSCH D. Survey of clustering algorithms [J]. IEEE Transactions on neural networks,2005,16(3):645-678.
    [120]CHRISTEN P, GOISER K. Quality and complexity measures for data linkage and deduplication[J]. Quality Measures in Data Mining,2007,2:127-151.
    [121]ELMAGARMID A, IPEIROTIS P, VERYKIOS V. Duplicate record detection:A survey[J]. IEEE Transactions on knowledge and data engineering,2007,2:1-16.
    [122]FELLEGI I, SUNTER A. A-theory for record linkage[J]. Journal of the American Statistical Association,1969,64(328):1183-1210.
    [123]CHRISTEN P. Automatic record linkage using seeded nearest neighbour and sup-port vector machine classification[C]//Proceeding of the 14th ACM SIGKDD in-ternational conference on Knowledge discovery and data mining. ACM.2008: 151-159.
    [124]CHURCHES T, CHRISTEN P, LIM K, et al. Preparation of name and address data for record linkage using hidden Markov models[J]. BMC Medical Informatics and Decision Making,2002,2(1):9.
    [125]BILENKO M, MOONEY R. Adaptive duplicate detection using learnable string similarity measures[C]//Proceedings of the ninth ACM SIGKDD international con-ference on Knowledge discovery and data mining. ACM.2003:39-48.
    [126]BHATTACHARYA I, GETOOR L. Iterative record linkage for cleaning and inte-gration [C]//Proceedings of the 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery. ACM.2004:11-18.
    [127]GU L, BAXTER R. Decision models for record linkage[C]//Data Mining. Springer. 2006:146-160.
    [128]CHOWDHURY A, FRIEDER O, GROSSMAN D, et al. Collection statistics for fast duplicate document detection [J]. ACM Transactions on Information Systems (TOIS),2002,20(2):171-191.
    [129} THEOBALD M, SIDDHARTH J, PAEPCKE A. Spotsigs:robust and efficient near duplicate detection in large web collections[C]//Proceedings of the 31st annual in-ternational ACM SIGIR conference on Research and development in information retrieval. ACM.2008:563-570.
    [130]CHIM H, DENG X. A new suffix tree similarity measure for document clus-tering[C]//Proceedings of the 16th international conference on World Wide Web. ACM.2007:121-130.
    [131]BRODER A, GLASSMAN S, MANASSE M, et al. Syntactic clustering of the web[J]. Computer Networks and ISDN Systems,1997,29(8-13):1157-1166.
    [132]CIOS K, PEDRYCZ W, SWINIARSKI R, et al. Data mining methods for knowledge discovery[M]. Kluwer Academic Publishers,1998.
    [133]HAMMOUDA K, KAMEL M. Efficient phrase-based document indexing for web document clustering[J]. IEEE Transactions on Knowledge and Data Engineering, 2004,33:1279-1296.