理论术语抽取的深度学习模型及自训练算法研究

英文篇名：A Deep Learning Model and Self-Training Algorithm for Theoretical Terms Extraction
作者：赵洪 ; 王芳
英文作者：Zhao Hong;Wang Fang;Department of Information Resources Management, Business School, Nankai University;
关键词：理论术语抽取 ; 深度学习 ; 循环神经网络 ; Bi-LSTM-CRF ; 自训练
英文关键词：theoretical terms extraction;;deep learning;;recurrent neural network(RNN);;bidirectional-long short term memory-conditional random field(Bi-LSTM-CRF);;self-training
中文刊名：QBXB
英文刊名：Journal of the China Society for Scientific and Technical Information
机构：南开大学商学院信息资源管理系;
出版日期：2018-09-24
出版单位：情报学报
年：2018
期：v.37
基金：国家社会科学基金重大项目“情报学学科建设与情报工作未来发展路径研究”(17ZDA291);国家社会科学基金重大项目“我国网络社会治理研究”(14ZDA063)
语种：中文;
页：QBXB201809007
页数：16
CN：09
ISSN：11-2257/G3
分类号：67-82

摘要

理论术语的抽取是大规模文献内容分析和跨学科知识转移深度揭示的基础。作为一种特定类型的命名实体,理论术语涉及的学科多、文献规模大、特征复杂,也缺乏大规模的成熟语料,因而抽取难度较大。为提高理论术语的抽取性能并降低训练集的人工标注代价,本文构建了面向理论术语抽取的深度学习模型,并研究了该模型中理论术语的特征构造和标注方法,同时也提出了一种自训练算法以实现模型的弱监督学习。通过实验对比,分别验证了本文模型和自训练算法的有效性,不仅为理论术语抽取提供了更加有效的通用方法,也为其他类型命名实体的识别研究提供了方法参考。
Extraction of theoretical terminology from literature is a precondition for more than one research field in information science, such as content analysis of large scale literature and interdisciplinary knowledge transfer revelation. As specific types of named entities, theoretical terms are distributed among many subjects and a large section of published literature, have complex characteristics, and lack large-scale mature corpuses, rendering their extraction quite challenging. To improve the extraction performance and reduce the cost of manual tagging for the training set, a deep learning model for theoretical term extraction was built based on the characteristics of the terms and a self-training algorithm aimed at achieving a weak supervised learning of the model; further, the characteristic construction and tagging method in the model were studied. The validities of the model and the self-training algorithm were verified via experimental comparisons. This study not only provides a more effective method for theoretical term extraction but also provides a reference for the recognition of other named entities.

引文

[1]维基百科.理论[EB/OL].[2018-01-25].https://zh.wikipedia.org/wiki/%E7%90%86%E8%AB%96.
    [2]王芳,陈锋,祝娜,等.我国情报学理论的来源、应用及学科专属度研究[J].情报学报,2016,35(11):1148-1164.
    [3]陈锋,翟羽佳,王芳.基于条件随机场的学术期刊中理论的自动识别方法[J].图书情报工作,2016,60(2):122-128.
    [4]陆伟,孟睿,刘兴帮.面向引用关系的引文内容标注框架研究[J].中国图书馆学报,2014,35(6):93-104.
    [5]徐庶睿,卢超,章成志.术语引用视角下的学科交叉测度——以PLOS ONE上六个学科为例[J].情报学报,2017,36(8):809-820.
    [6]Huang Z,Xu W,Yu K.Bidirectional LSTM-CRF models for sequence tagging[J].ar Xiv preprint ar Xiv:1508.01991,2015.
    [7]Rondeau M A,Su Y.LSTM-based Neuro CRFs for named entity recognition[C]//INTERSPEECH 2016:The 17th Annual Conference of the International Speech Communication Association,San Francisco,CA,USA,2016:665-669.
    [8]化柏林.针对中文学术文献的情报方法术语抽取[J].现代图书情报技术,2013(6):68-75.
    [9]Collobert R,Weston J,Karlen M,et al.Natural language processing(almost)from Scratch[J].Journal of Machine Learning Research,2011,12(1):2493-2537.
    [10]Sundermeyer M,Schlüter R,Ney H.LSTM neural networks for language modeling[C]//INTERSPEECH 2012:The 13th Annual Conference of the International Speech Communication Association,Portland,Oregon,USA,2012:601-608.
    [11]Graves A,Mohamed A R,Hinton G.Speech recognition with deep recurrent neural networks[C]//Proceedings of the 2013IEEE International Conference on Acoustics,Speech and Signal Processing.IEEE,2013:6645-6649.
    [12]Chiu J P C,Nichols E.Named entity recognition with bidirectional LSTM-CNNs[J].Transactions of the Association for Computational Linguistics,2016,4:357-370.
    [13]Lample G,Ballesteros M,Subramanian S,et al.Neural architectures for named entity recognition[C]//Proceedings of the 15th Annual Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg:Association for Computational Linguistics,2016:260-270.
    [14]Ma X Z,Hovy E.End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:Association for Computational Linguistics,2016:1064-1074.
    [15]Limsopatham N,Collier N.Bidirectional LSTM for named entity recognition in Twitter messages[C]//Proceedings of the 2nd Workshop on Noisy User-generated Text,Osaka,Japan,2016:145-152.
    [16]He H,Sun X.A unified model for cross-domain and semi-supervised named entity recognition in Chinese social media[C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.Palo Alto:AAAI Press,2017:3216-3222.
    [17]Pham T H,Le-Hong P.End-to-end recurrent neural network models for vietnamese named entity recognition:Word-level vs.character-level[C]//Proceedings of the International Conference of the Pacific Association for Computational Linguistics.Singapore:Springer,2018,781:219-232.
    [18]Dong C H,Wu H J,Zhang J J,et al.Multichannel LSTM-CRF for named entity recognition in Chinese social media[C]//Proceedings of the Sixteenth China National Conference on Computational Linguistics.Cham:Springer,2017,10565:197-208.
    [19]Yi H K,Huang J M,Yang S Q.A Chinese Named Entity Recognition System with Neural Networks[C]//Proceedings of the 4th International Conference on Information Technology and Applications.EDP Sciences,2017:Article No.04002.
    [20]Peters M E,Ammar W,Bhagavatula C,et al.Semi-supervised sequence tagging with bidirectional language models[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:Association for Computational Linguistics,2017:1756-1765.
    [21]Ni J,Dinu G,Florian R.Weakly supervised cross-lingual named entity recognition via effective annotation and representation projection[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:Association for Computational Linguistics,2017:1470-1480.
    [22]Rei M,Crichton G K O,Pyysalo S.Attending to characters in neural sequence labeling models[C]//Proceedings of COLING2016,the 26th International Conference on Computational Linguistics:Technical Papers,Osaka,Japan,2016:309-318.
    [23]Sun Y Q,Li L,Xie Z W,et al.Co-training an improved recurrent neural network with probability statistic models for named entity recognition[C]//Proceedings of the 22nd International Conference on Database Systems for Advanced Applications.Cham:Springer,2017:545-555.
    [24]Shen Y Y,Yun H,Lipton Z C,et al.Deep active learning for named entity recognition[C]//Proceedings of the 2nd Workshop on Representation Learning for NLP.Stroudsburg:Association for Computational Linguistics,2017:252-256.
    [25]Yang Z,Salakhutdinov R,Cohen W W.Transfer learning for sequence tagging with hierarchical recurrent networks[C]//Proceedings of the 5th International Conference on Learning Representations,Toulon,France,2017.
    [26]Mikolov T,Karafiát M,Burget L,et al.Recurrent neural network based language model[C]//Proceedings of the 14th Annual Conference of the International Speech Communication Association,Makuhari,Chiba,Japan,2010:1045-1048.
    [27]Sundermeyer M,Schlüter R,Ney H.LSTM Neural Networks for Language Modeling[C]//Proceedings of the 13th Annual Conference of the International Speech Communication Association,Portland,USA,2012:601-608.
    [28]Lafferty J D,Mccallum A,Pereira F C N.Conditional random fields:Probabilistic models for segmenting and labeling sequence data[C]//Proceedings of the Eighteenth International Conference on Machine Learning.San Francisco:Morgan Kaufmann Publishers,2001:282-289.
    [29]王芳.情报学的范式变迁及元理论研究[J].情报学报,2007,26(5):764-773.
    [30]王芳,史海燕,纪雪梅.我国情报学研究中理论的应用:基于《情报学报》的内容分析[J].情报学报,2015,34(6):581-591.
    [31]孙小礼.数学·科学·哲学[M].北京:光明日报出版社,1988:195-209.
    [32]Hinton G E.Learning distributed representations of concepts[C]//Proceedings of the 8th Annual Conference of the Cognitive Science Society,Amherst,USA,1986:1-12.
    [33]Mikolov T,Corrado G,Chen K,et al.Efficient estimation of word representations in vector space[C]//Proceedings of the International Conference on Learning Representations,Scottsdale,Arizona,USA,2013:1-12.
    [34]张剑,屈丹,李真.基于词向量特征的循环神经网络语言模型[J].模式识别与人工智能,2015,28(4):299-305.
    [35]Hinton G E,Srivastava N,Krizhevsky A,et al.Improving neural networks by preventing co-adaptation of feature detectors[J].Computer Science,2012,3(4):212-223.
    [36]Hagenauer J,Hoeher P.A Vitrbi algorithm with soft-decision outputs and its applications[C]//Proceedings of the IEEE Global Telecommunications Conference and Exhibition‘Communications Technology for the 1990s and Beyond’,1989:47.
    [37]裴文端,罗伟雄,李文铎.SOVA译码算法与性能[J].无线电工程,2003,33(11):11-13.
    [38]姜小波,陈杰,仇玉林.一种简化的SOVA算法[J].电子器件,2004,27(3):467-469.
    [39]杨建祖,顾小卓,杜晓宁,等.SOVA算法对Viterbi算法的修正[J].通信技术,2007(4):4-6.
    [40]滕少华.基于CRFs的中文分词和短文本分类技术[D].北京:清华大学,2009.
    [41]高兴龙,张鹏远,张震,等.基于条件随机场的词级别置信度研究[C]//中国科学院声学研究所第四届青年学术会议论文集,2012:290-293.
    [42]闫紫飞,姬东鸿.基于CRF和半监督学习的中文时间信息抽取[J].计算机工程与设计,2015,36(6):1642-1646.
    [43]陈季梦,刘杰,黄亚楼,等.基于半监督CRF的缩略词扩展解释识别[J].计算机工程,2013,39(4):203-209.
    [44]Murthy V R,Bhattacharyya P.A deep learning solution to named entity recognition[C]//Proceedings of the International Conference on Intelligent Text Processing and Computational Linguistics.Cham:Springer,2016:427-438.