摘要
随着信息技术的飞速发展以及网民规模的扩大,互联网数据量与日俱增,其中含有大量非结构化文本数据,因此,文中分类已成为当前的研究热点。特征选择的好坏直接影响文本分类的精度。传统单一的特征选择方法侧重点不同,使用不同的特征选择方法选择后的特征子集可能差别较大,进而导致不稳定的分类结果。文中提出了一种混合CHI与IG的特征选择方法,引入了融合特征的指标SOM(Score of Mixed),将特征根据SOM值排序,通过预定的阈值进行特征筛选,得出相对稳定且具代表性的特征子集。实验结果表明,使用该方法进行特征选择,文本分类的效果相比使用其他特征选择方法有一定的提升。
With the rapid development of information technology and the expansion of Internet users,the amount of Internet data is increasing day by day,which contains a large amount of unstructured text data.Therefore,text categorization has become a hot research topic. The quality of feature selection directly affects the accuracy of text classification. The traditional single feature selection method has different emphasis. Feature subsets selected by using different feature selection methods may differ greatly,which leads to unstable classification results. In this paper,a feature selection method combined CHI and IG is proposed. The SOM( Score of Mixed) is introduced. The features are sorted according to the SOM value. The feature is screened by a predetermined threshold to obtain a relatively stable and representative subset of features. The experimental results show that using this method for feature selection,the effect of text classification has a certain improvement compared with other feature selection methods.
引文
[1]Abbasi A,France S,Zhang Z,et al. Selecting attributes for sentiment classification using feature relation networks[J]. IEEE Trans Knowl Data Eng,2011,23(3):447-462.
[2]Li B,Chow T W S,Huang D. A novel feature selection method and its application[J]. Journal of Intelligent Information Systems,2013,41(2):235-268.
[3]Liu M,Yang J. An Improvement of TFIDF Weighting in Text Categorization[J]. International Proceedings of Computer Science&Information Technology,2012,57:44.
[4]Tan S. An effective refinement strategy for KNN text classifier[J].Expert Systems with Applications,2009,30(2):290-298.
[5]Chen J N,Huang H K,Tian F Z,et al. Feature selection for text classification with Naive Bayes[J]. Expert System with Applications,2009,36(3):5432-5435.
[6] Lu K,Chen L. The Improvement Research of Mutual Information Algorithm for Text Categorization[J]. Advances in Intelligent Systems&Computing,2014,278:225-232.
[7]Wei H,Billings S A. Feature Subset Selection and Ranking for Data Dimensionality Reduction[J]. IEEE Transactions on Pattern Analysis&Machine Intelligence,2007,29(1):162-166.
[8] Liu H,Yu L. Toward integrating feature selection algorithms for classification and clustering[J]. IEEE Transactions on Knowledge and Data Engineering,2005,17(4):491-502.
[9]Kim Y,Howland P,Park H. Dimension Reduction in Text Classification with Support Vector Machines[J]. Journal of Machine Learning Research,2005(6):37-53.
[10]Lee C,Lee G G. Information gain and divergence-based feature selection for machine learning-based text categorization[J]. Information processing&management,2006,42(1):155-165.
[11]袁磊.基于改进CHI特征选择的情感文本分类研究[J].传感器与微系统,2017,36(5):47-51.
[12]徐戈,王厚峰.自然语言处理中主题模型的发展[J].计算机学报,2011,34(8):1423-1436.
[13]邓万宇,刘丹,董莹莹,等.多模场景下的高维数据的特征选择及分类研究[J].信息技术,2018,42(7):39-42.