摘要
分词是自然语言处理的一项基础性工作,对自然语言处理的后继工作有较大的影响。紧缩格的识别是藏文分词中最难、最重要的技术之一。通过剖析已有藏文紧缩词识别方法,分析藏文字词的特征,针对性地提出了识别藏文紧缩格的规则算法、添加—还原算法和最大熵模型的特征模板,从而得到基于规则、添加还原法与最大熵模型相结合的藏文紧缩格识别方法。实验数据表明,该方法识别藏文紧缩格的准确率、召回率和F1值分别达99. 26%、96. 47%、97. 85%,比现有最高的准确率有了较明显的提高。
Word segmentation is a basic work of natural language processing,which has a great influence on the subsequent work of it. The recognition of abbreviated case-auxiliary words is one of the most difficult and important technologies of Tibetan word segmentation. Through dissecting the existing recognition methods of abbreviated case-auxiliary words,this paper analyzed the characteristics of Tibetan words,pointedly proposed recognition algorithm of Tibetan abbreviated case-auxiliary words rules,add-restore algorithm and the maximum entropy models feature template,then the methods of recognizing abbreviated case-auxiliary words based on the rules,it obtained add-restore methods and the maximum entropy model. The experimental data shows that the accuracy,recall rate and F1 value of the method is 99. 26%,96. 47%,and 97. 85% respectively,which shows an obvious progress than that of the existing methods.
引文
[1]孙萌,华却才让,才智杰,等.基于判别式分类和重排序技术的藏文分词[J].中文信息学报,2014,28(2):61-65.(Sun Meng,Hua Quecairang,Cai Zhijie,et al.Tibetan word segmentation based on discriminative classification and reranking[J].Journal of Chinese Information Processing,2014,28(2):61-65.)
[2]陈玉忠,李保利,俞士汶.藏文自动分词系统的设计与实现[J].中文信息学报,2003,17(3):15-20.(Chen Yuzhong,Li Baoli,Yu Shiwen.The design and implementation of a Tibetan word segmentation system[J].Journal of Chinese Information Processing,2003,17(3):15-20.)
[3]才智杰.班智达藏文自动分词系统的设计与实现[J].青海师范大学民族师范学院学报,2010,21(2):75-77.(Cai Zhijie.The design and implementation of a Tibetan word segmentation systemBanzhida[J].Journal of Minorities College of Qinghai Teachers University,2010,21(2):75-77.)
[4]刘汇丹,诺明华,赵维纳,等.Seg T:一个实用的藏文分词系统[J].中文信息学报,2012,26(1):97-103.(Liu Huidan,Nuo Minghua,Zhao Weina,et al.SegT:a practical Tibetan word segmentation system[J].Journal of Chinese Information Processing,2012,26(1):97-103.)
[5]史晓东,卢亚军.央金藏文分词系统[J].中文信息学报,2011,25(4):54-56.(Shi Xiaodong,Lu Yajun.A Tibetan segmentation system-Yangjin[J].Journal of Chinese Information Processing,2011,25(4):54-56.)
[6]康才畯.藏语分词与词性标注研究[D].上海:上海师范大学,2014.(Kang Caijun.Tibetan word segmentation and part of speech tagging[D].Shanghai:Shanghai Normal University,2014.)
[7]龙从军,刘汇丹.藏文自动分词的理论与方法研究[M].北京:知识产权出版社,2016.(Long Congjun,Liu Huidan.Research on the theory and method of Tibetan automatic word segmentation[M].Beijing:Intellectual Property Publisher,2016.)
[8]李亚超,江静,加羊吉,等.TIP-LAS:一个开源的藏文分词词性标注系统[J].中文信息学报,2015,29(6):204-207.(Li Yachao,Jiang Jing,Jia Yangji,et al.TIP-LAS:an open source toolkit for Tibetan word segmentation and POS tagging[J].Journal of Chinese Information Processing,2015,29(6):204-207.)
[9]洛桑嘎登,杨媛媛,赵小兵.基于知识融合的CRFs藏文分词系统[J].中文信息学报,2015,29(6):213-219.(Luobsang Karten,Yang Yuanyuan,Zhao Xiaobing.Tibetan automatic word segmentation based on conditional random fields and knowledge fusion[J].Journal of Chinese Information Processing,2015,29(6):213-219.)
[10]李亚超,加羊吉,江静,等.融合无监督特征的藏文分词方法研究[J].中文信息学报,2017,31(2):72-75.(Li Yachao,Jia Yangji,Jiang Jing,et al.Study on fusion of unsupervised features for Tibetan word segmentation[J].Journal of Chinese Information Processing,2017,31(2):72-75.)
[11]才智杰.藏文自动分词系统中紧缩词的识别[J].中文信息学报,2009,23(1):35-37.(Cai Zhijie.Identification of abbreviated word in Tibetan word segmentation[J].Journal of Chinese Information Processing,2009,23(1):35-37.)
[12]完么扎西,尼玛扎西.藏语自动分词中的几个关键问题的研究[J].中文信息学报,2014,28(4):132-139.(Wanmezhaxi,Nimazhaxi.Research on several key issues in Tibetan word segmentation[J].Journal of Chinese Information Processing,2014,28(4):132-139.)
[13]李亚超,加羊吉,宗成庆,等.基于条件随机场的藏语自动分词方法研究与实现[J].中文信息学报,2013,27(4):52-58.(Li Yachao,Jia Yangji,Zong Chengqing,et al.Research and implementation of Tibetan automatic word segmentation based on conditional random field[J].Journal of Chinese Information Processing,2013,27(4):52-58.)
[14]华却才让,姜文斌,赵海兴,等.基于感知机模型藏文命名实体识别[J].计算机工程与应用,2014,50(15):172-176.(Hua Quecairang,Jiang Wenbin,Zhao Haixing,et al.Tibetan name entity recognition with perceptron model[J].Computer Engineering and Applications,2014,50(15):172-176.)
[15]康才畯,龙从军,江荻.基于词位的藏文黏写形式的切分[J].计算机工程与应用,2014,50(11):218-222.(Kang Caijun,Long Congjun,Jiang Di.Segmentation of Tibetan abbreviated forms based on word position[J].Computer Engineering and Applications,2014,50(11):218-222.)
[16]吉太加.藏文语法研究[M].青海:青海民族出版社,2013.(Ji Taijia.Tibetan grammar research[M].Qinghai:Qinghai Nationalities Press,2013.)
[17]才智杰,才让卓玛.班智达藏文标注词典设计[J].中文信息学报,2010,24(5):46-49.(Cai Zhijie,Cai Rangzhuoma.Design of Tibetan part of speech tagging dictionary[J].Journal of Chinese Information Processing,2010,24(5):46-49.)
[18]宗成庆.统计自然语言处理[M].2版.北京:清华大学出版社,2013:81-128.(Zong Chengqing.Statistical natural language processing[M].2nd ed.Beijing:Tsinghua University Press,2013:81-128.)
[19]Liu Huidan,Zhao Weina,Nuo Minghua,et al.Tibetan word segmentation as syllable tagging using conditional random fields[C]//Proc of the 25th Pacific Asia Conference on Language,Information and Computation.2011:168-177.
[20]于江德,王希杰,樊孝忠.基于最大熵模型的词位标注汉语分词[J].郑州大学学报:理学版,2011,43(1):70-74.(Yu Jiangde,Wang Xijie,Fan Xiaozhong.Chinese word segmentation via word-position tagging based on maximum entropy model[J].Journal of Zhengzhou University:Natural Science Edition,2011,43(1):70-74.)