藏文紧缩格识别方法
详细信息    查看全文 | 推荐本文 |
  • 英文篇名:Recognition method of Tibetan abbreviated case-auxiliary words
  • 作者:拉玛扎西 ; 才智杰 ; 扎西吉
  • 英文作者:Lhamatashi;Cai Zhijie;Zha Xiji;School of Computer,Qinghai Normal University;
  • 关键词:藏文 ; 自然语言处理 ; 分词 ; 紧缩格
  • 英文关键词:Tibetan;;NLP;;segmentation;;abbreviated case-auxiliary words
  • 中文刊名:JSYJ
  • 英文刊名:Application Research of Computers
  • 机构:青海师范大学计算机学院;
  • 出版日期:2018-03-14 17:30
  • 出版单位:计算机应用研究
  • 年:2019
  • 期:v.36;No.330
  • 基金:国家自然科学基金资助项目(61866032,61163018,61262051);; 国家社科基金(13BYY141,16BYY167,15BYY167);; 国家教育部“春晖计划”合作科研项目(Z2012093,Z2016077);; 青海省基础研究项目(2017-ZJ-767,2019-SF-129,2015-SF-520);; “长江学者和创新团队发展计划”创新团队资助项目(IRT1068);; 青海省重点实验室项目(2013-Z-Y17,2014-Z-Y32,2015-Z-Y03);; 藏文信息处理与机器翻译重点实验室(2013-Y-17)
  • 语种:中文;
  • 页:JSYJ201904028
  • 页数:4
  • CN:04
  • ISSN:51-1196/TP
  • 分类号:126-129
摘要
分词是自然语言处理的一项基础性工作,对自然语言处理的后继工作有较大的影响。紧缩格的识别是藏文分词中最难、最重要的技术之一。通过剖析已有藏文紧缩词识别方法,分析藏文字词的特征,针对性地提出了识别藏文紧缩格的规则算法、添加—还原算法和最大熵模型的特征模板,从而得到基于规则、添加还原法与最大熵模型相结合的藏文紧缩格识别方法。实验数据表明,该方法识别藏文紧缩格的准确率、召回率和F1值分别达99. 26%、96. 47%、97. 85%,比现有最高的准确率有了较明显的提高。
        Word segmentation is a basic work of natural language processing,which has a great influence on the subsequent work of it. The recognition of abbreviated case-auxiliary words is one of the most difficult and important technologies of Tibetan word segmentation. Through dissecting the existing recognition methods of abbreviated case-auxiliary words,this paper analyzed the characteristics of Tibetan words,pointedly proposed recognition algorithm of Tibetan abbreviated case-auxiliary words rules,add-restore algorithm and the maximum entropy models feature template,then the methods of recognizing abbreviated case-auxiliary words based on the rules,it obtained add-restore methods and the maximum entropy model. The experimental data shows that the accuracy,recall rate and F1 value of the method is 99. 26%,96. 47%,and 97. 85% respectively,which shows an obvious progress than that of the existing methods.
引文
[1]孙萌,华却才让,才智杰,等.基于判别式分类和重排序技术的藏文分词[J].中文信息学报,2014,28(2):61-65.(Sun Meng,Hua Quecairang,Cai Zhijie,et al.Tibetan word segmentation based on discriminative classification and reranking[J].Journal of Chinese Information Processing,2014,28(2):61-65.)
    [2]陈玉忠,李保利,俞士汶.藏文自动分词系统的设计与实现[J].中文信息学报,2003,17(3):15-20.(Chen Yuzhong,Li Baoli,Yu Shiwen.The design and implementation of a Tibetan word segmentation system[J].Journal of Chinese Information Processing,2003,17(3):15-20.)
    [3]才智杰.班智达藏文自动分词系统的设计与实现[J].青海师范大学民族师范学院学报,2010,21(2):75-77.(Cai Zhijie.The design and implementation of a Tibetan word segmentation systemBanzhida[J].Journal of Minorities College of Qinghai Teachers University,2010,21(2):75-77.)
    [4]刘汇丹,诺明华,赵维纳,等.Seg T:一个实用的藏文分词系统[J].中文信息学报,2012,26(1):97-103.(Liu Huidan,Nuo Minghua,Zhao Weina,et al.SegT:a practical Tibetan word segmentation system[J].Journal of Chinese Information Processing,2012,26(1):97-103.)
    [5]史晓东,卢亚军.央金藏文分词系统[J].中文信息学报,2011,25(4):54-56.(Shi Xiaodong,Lu Yajun.A Tibetan segmentation system-Yangjin[J].Journal of Chinese Information Processing,2011,25(4):54-56.)
    [6]康才畯.藏语分词与词性标注研究[D].上海:上海师范大学,2014.(Kang Caijun.Tibetan word segmentation and part of speech tagging[D].Shanghai:Shanghai Normal University,2014.)
    [7]龙从军,刘汇丹.藏文自动分词的理论与方法研究[M].北京:知识产权出版社,2016.(Long Congjun,Liu Huidan.Research on the theory and method of Tibetan automatic word segmentation[M].Beijing:Intellectual Property Publisher,2016.)
    [8]李亚超,江静,加羊吉,等.TIP-LAS:一个开源的藏文分词词性标注系统[J].中文信息学报,2015,29(6):204-207.(Li Yachao,Jiang Jing,Jia Yangji,et al.TIP-LAS:an open source toolkit for Tibetan word segmentation and POS tagging[J].Journal of Chinese Information Processing,2015,29(6):204-207.)
    [9]洛桑嘎登,杨媛媛,赵小兵.基于知识融合的CRFs藏文分词系统[J].中文信息学报,2015,29(6):213-219.(Luobsang Karten,Yang Yuanyuan,Zhao Xiaobing.Tibetan automatic word segmentation based on conditional random fields and knowledge fusion[J].Journal of Chinese Information Processing,2015,29(6):213-219.)
    [10]李亚超,加羊吉,江静,等.融合无监督特征的藏文分词方法研究[J].中文信息学报,2017,31(2):72-75.(Li Yachao,Jia Yangji,Jiang Jing,et al.Study on fusion of unsupervised features for Tibetan word segmentation[J].Journal of Chinese Information Processing,2017,31(2):72-75.)
    [11]才智杰.藏文自动分词系统中紧缩词的识别[J].中文信息学报,2009,23(1):35-37.(Cai Zhijie.Identification of abbreviated word in Tibetan word segmentation[J].Journal of Chinese Information Processing,2009,23(1):35-37.)
    [12]完么扎西,尼玛扎西.藏语自动分词中的几个关键问题的研究[J].中文信息学报,2014,28(4):132-139.(Wanmezhaxi,Nimazhaxi.Research on several key issues in Tibetan word segmentation[J].Journal of Chinese Information Processing,2014,28(4):132-139.)
    [13]李亚超,加羊吉,宗成庆,等.基于条件随机场的藏语自动分词方法研究与实现[J].中文信息学报,2013,27(4):52-58.(Li Yachao,Jia Yangji,Zong Chengqing,et al.Research and implementation of Tibetan automatic word segmentation based on conditional random field[J].Journal of Chinese Information Processing,2013,27(4):52-58.)
    [14]华却才让,姜文斌,赵海兴,等.基于感知机模型藏文命名实体识别[J].计算机工程与应用,2014,50(15):172-176.(Hua Quecairang,Jiang Wenbin,Zhao Haixing,et al.Tibetan name entity recognition with perceptron model[J].Computer Engineering and Applications,2014,50(15):172-176.)
    [15]康才畯,龙从军,江荻.基于词位的藏文黏写形式的切分[J].计算机工程与应用,2014,50(11):218-222.(Kang Caijun,Long Congjun,Jiang Di.Segmentation of Tibetan abbreviated forms based on word position[J].Computer Engineering and Applications,2014,50(11):218-222.)
    [16]吉太加.藏文语法研究[M].青海:青海民族出版社,2013.(Ji Taijia.Tibetan grammar research[M].Qinghai:Qinghai Nationalities Press,2013.)
    [17]才智杰,才让卓玛.班智达藏文标注词典设计[J].中文信息学报,2010,24(5):46-49.(Cai Zhijie,Cai Rangzhuoma.Design of Tibetan part of speech tagging dictionary[J].Journal of Chinese Information Processing,2010,24(5):46-49.)
    [18]宗成庆.统计自然语言处理[M].2版.北京:清华大学出版社,2013:81-128.(Zong Chengqing.Statistical natural language processing[M].2nd ed.Beijing:Tsinghua University Press,2013:81-128.)
    [19]Liu Huidan,Zhao Weina,Nuo Minghua,et al.Tibetan word segmentation as syllable tagging using conditional random fields[C]//Proc of the 25th Pacific Asia Conference on Language,Information and Computation.2011:168-177.
    [20]于江德,王希杰,樊孝忠.基于最大熵模型的词位标注汉语分词[J].郑州大学学报:理学版,2011,43(1):70-74.(Yu Jiangde,Wang Xijie,Fan Xiaozhong.Chinese word segmentation via word-position tagging based on maximum entropy model[J].Journal of Zhengzhou University:Natural Science Edition,2011,43(1):70-74.)