文本分类中训练集相关数量指标的影响研究

英文篇名：Study about effect of relevant quantitative indexes of training set in text classification
作者：李湘东 ; 曹环 ; 黄莉
英文作者：LI Xiang-dong;CAO Huan;HUANG Li;School of Information Management,Wuhan University;Center for the Studies of Information Resources(CSIR),Wuhan University;Library,Wuhan University;
关键词：训练集优化 ; 文本分类 ; 多因素方差分析 ; 语料库 ; 相关数量指标
英文关键词：training set optimization;;text classification;;multiple ANOVA;;corpus;;relevant quantitative indexes
中文刊名：JSYJ
英文刊名：Application Research of Computers
机构：武汉大学信息管理学院;武汉大学信息资源研究中心;武汉大学图书馆;
出版日期：2014-04-18 09:26
出版单位：计算机应用研究
年：2014
期：v.31;No.277
语种：中文;
页：JSYJ201411029
页数：5
CN：11
ISSN：51-1196/TP
分类号：130-133+138

摘要

针对训练集对分类性能的影响,从训练集的文本数、类别数以及特征项数这三项数量指标出发进行研究。使用多因素方差分析方法及多种语料库定量探讨该三项数量指标对分类性能的影响规律。结果发现特征项数对分类性能的影响在不同的文本数和类别数时是不同的,分类性能受训练集的这三项指标的交互影响,通过对训练集的这三项指标进行优化,提出了从分类算法、特征项选择法以外提高分类性能的途径。在真实数据上的实验结果表明,该方法可有效提高分类性能。
This paper studied the impacts on the efficiency of text automatic categorization system coming from three quantitative indexes of training set,including the number of features,categories and texts in each category. It used multifactor analysis of variance(multiple ANOVA) and took different kinds of corpus to explore the influence rule of three quantitative indexes on the system efficiency. The results show that the impact of feature numbers on the classification accuracy depends on different texts number and categories number,and three quantitative indexes in the training set affect the classification accuracy interactively. It raised a new way to improve the classify efficiency through optimizing relevant quantitative indexes of training set.The experimental results of the real world data show that the proposed method has a relative good performance to text categorization.

引文

[1]林琛.李弼程,周杰.基于信息粒度的交叠类文本分类方法[J].情报学报,2011,30(4):339-346.
    [2]JAPKOWICZ N,STEPHE S.The class imbalance problem:a systematic study[J].Intelligent Data Analysis,2002,6(5):429-449.
    [3]LI Rong-lu,HU Yun-fa.Noise reduction to text categorization based on density for KNN[C]//Proc of the 2nd International Conference on Machine Learning and Cybernetics.2003:3119-3124.
    [4]刘海峰,姚泽清,苏展,等.文本分类中基于K-means的类偏斜KNN样本剪裁[J].微电子学与计算机,2012,29(5):24-28.
    [5]张若峰.基于实例的文本自动分类技术的研究和实现[D].长春:吉林大学,2005.
    [6]徐凤亚,罗振声.文本自动分类中特征权重算法的改进研究[J].计算机工程与应用,2005,41(1):181-184,220.
    [7]ZHANG Tong,OLES F J.Text categorization based on regularized linear classification methods[J].Information Retrieval,2001,4(1):5-31.
    [8]BEKKERMAN R,EI-YANIV R,TISHBY N,et al.Distributional world clusters vs.words for text categorization[J].Journal of Machine Learning Research,2003,3(3):1183-1208.
    [9]胡晓,王理,潘守慧.基于改进VSM的Web文本分类方法[J].情报杂志,2012,29(5):144-147.
    [10]MARKOVITC H S,ROSENSTEIN D.Feature generation using general constructor functions[J].Machine Learning,2002,49(1):59-98.
    [11]ZHANG Jian,YANG Yi-ming.Robustness of regularized linear classification methods in text categorization[C]//Proc of the 26th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.New York:ACM Press,2003:190-197.
    [12]陈玉芹.多类别科技文献自动分类系统[D].武汉:华中科技大学,2008.
    [13]樊兴华,孙茂松.一种高性能的两类中文文本分类方法[J].计算机学报,2006,29(1):124-131.
    [14]贾宁.使用概念基元特征进行自动文本分类[J].计算机工程与应用,2007,43(1):24-26.
    [15]ZHENG Zhao-hui,WU Xiao-yun,SRIHARI R.Feature selection for text categorization on imbalanced data[J].ACM SIGKDD Explorations Newsletter,2004,6(1):80-89.
    [16]GUPTA R,RATINOV L.Text categorization with knowledge transfer from heterogeneous data sources[C]//Proc of the 23rd AAAI Conference on Artificial Intelligence.[S.l.]:AAAI Press,2008:842-847.
    [17]LEWIS D D.Reuters-21578 text categorization text collection[EB/OL].[2013-08-22].http://www.daviddlewis.com/resources/testcollections/reuters21578.
    [18]搜狗实验室—文本分类语料库[EB/OL].[2013-08-22].http://www.sogou.com/labs/dl/t.html.
    [19]何琳,刘竟,侯汉清.基于《中图法》的多层自动分类影响因素分析[J].中国图书馆学报,2009,35(184):49-55.