Research on a geological entity relation extraction model for gold mine based on BERT
-
摘要: 金矿实体关系的智能识别是提高金矿文献分析挖掘和知识提取的重要方法和途径。此次研究针对目前金矿实体关系抽取涉及到的核心问题,如金矿实体关系复杂、人工标注信息少等特点,提出了基于BERT(Bidirectional Encoder Representations from Transformer)的远程监督关系抽取模型。并通过金矿地质数据编码、金矿分类和金矿地质实体过滤等模块的优化改进,提高了金矿地质实体关系抽取的准确率。最后通过对金矿文献数据的实体关系抽取实验,验证了该方法的有效性。Abstract: Intelligent identification of entity relation is an important method and approach to improve literature mining and analysis, and knowledge extraction of gold mine. This study focuses on the core issues affecting current entity relation extraction of gold mine such as complex entity relation and less manual annotation information, and proposes a BERT (Bidirectional Encoder Representations from Transformer) remotely supervised relation extraction model. The accuracy of relation extraction is increased by optimizing and improving the modules related to geological data coding, geological classification and geological entity filtering. And the effectiveness of the model is verified by the entity relation extraction experiment of 290489 pieces of gold ore documents.
-
Key words:
- remote supervision /
- relation extraction /
- BERT /
- geological entity
-
表 1 实验参数
Table 1. Experiment parameters
参数名称 参数名称(英文) 符号 参数值 批大小 Batch_size B 8 学习率 Adam Learning_rate λ 2e-5 批次 Number of epoch E 6 随机丢弃率 Droupout rate P 0.1 最大句子长度 Max sentence length ML 384 表 2 各种模型在NYT数据集上的抽取效果
Table 2. Extraction effect of the models in NYT dataset
模型 接受者操作特征曲线下方面积大小(AUC) Top N项准确率(P@N/%) 平均准确率(Avg Prec/%) 100 200 300 500 1000 2000 5000 Top 300 Top 1000 Top 5000 DenseNet 0.34 81.0 69.5 68.7 61.4 51.6 39.5 22.4 73.1 66.4 56.3 ResNet 0.10 54.0 50.0 48.0 43.0 31.0 19.0 9.9 50.7 45.2 36.4 PCNN+ATT 0.32 74.0 67.5 64.3 59.8 48.7 37.2 22.3 68.6 62.9 53.4 文章模型 0.65 98.0 96.0 94.3 92.6 91.1 80.9 67.0 96.1 94.4 88.6 表 3 各个方法在地质领域数据集上的抽取效果
Table 3. Extraction effect of the methods in geological dataset
模型 接受者操作特征曲线下方面积大小(AUC) Top N项准确率(P@N/%) 平均准确率(Avg Prec/%) 100 200 100 500 100 2000 100 Top300 Top 1000 Top5000 DenseNet 0.40 88.0 54.5 39.7 33.8 23.5 15.8 8.0 60.7 47.9 37.6 ResNet 0.34 70.0 52.5 40.3 30.6 23.6 15.3 7.8 54.3 43.4 34.3 PCNN+ATT 0.60 99.0 81.0 63.3 50.0 30.5 18.3 8.1 81.1 64.8 50.0 文章模型 0.75 100.0 100.0 99.3 98.4 98.6 96.1 93.1 99.8 99.3 97.9 -
ALT C, HVBNER M, HENNIG L, 2019. Fine-tuning pre-trained transformer language models to distantly supervised relation extraction[C]//Proceedings of the 57th annual meeting of the association for computational linguistics. Florence, Italy: Association for Computational Linguistics: 1388-1398. BING X Y, SHEN L D, ZHENG L Y, 2019. A moderately deep convolutional neural network for relation extraction[C]//Proceedings of the 2019 11th international conference on machine learning and computing. New York, NY, USA: Association for Computing Machinery: 173-177. CAI Q, HAO J Y, CAO J, et al., 2018. Multi-level attention mechanism based distant supervision for relation extraction[J]. Journal of Chinese Information Processing, 32(1): 96-101. (in Chinese with English abstract) CAI Q, LI J, HAO J Y, 2019. Distant supervision relation extraction based on focal loss and residual network[J]. Computer Engineering, 45(12): 166-170. (in Chinese with English abstract) CHEN J P, LI J, XIE S, et al., 2017. China geological big data research status[J]. Journal of Geology, 41(3): 353-366. (in Chinese with English abstract) http://gateway.proquest.com/openurl?res_dat=xri:pqm&ctx_ver=Z39.88-2004&rfr_id=info:xri/sid:baidu&rft_val_fmt=info:ofi/fmt:kev:mtx:article&genre=article&jtitle=Journal%20of%20Geology&atitle=China%20geological%20big%20data%20research%20status DEVLIN J, CHANG M W, LEE K, et al., 2019. BERT: Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies. Minneapolis, Minnesota: Association for Computational Linguistics: 4171-4186. FENG J, HUANG M L, ZHAO L, et al., 2018. Reinforcement learning for relation classification from noisy data[C]//Proceedings of the 32nd AAAI conference on artificial intelligence. Menlo Park, CA: AAAI: 5779-5786. GAO H, LIU Z, VAN DER MAATEN L, et al., 2017. Densely connected convolutional networks[C]//Proceedings of the 2017 IEEE conference on computer vision and pattern recognition. Honolulu, HI, USA: IEEE: 4700-4708. HOFFMANN R, ZHANG C L, LING X, et al., 2011. Knowledge-based weak supervision for information extraction of overlapping relations[C]//Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies. Portland, Oregon, USA: Association for Computational Linguistics: 541-550. HUANG Y Y, WANG W Y, 2017. Deep residual learning for weakly-supervised relation extraction[C]//proceedings of the 2017 conference on empirical methods in natural language processing. Copenhagen, Denmark: Association for Computational Linguistics: 1803-1807. LIN T Y, GOYAL P, GIRSHICK R, et al., 2017. Focal loss for dense object detection[C]//2017 IEEE international conference on computer vision (ICCV). Venice, Italy: IEEE: 2999-3007. LIN Y K, SHEN S Q, LIU Z Y, et al., 2016. Neural relation extraction with selective attention over instances[C]//Proceedings of the 54th annual meeting of the association for computational linguistics (volume 1: long papers). Berlin, Germany: Association for Computational Linguistics: 2124-2133. LYU P F, WANG C N, ZHU Y Q, 2017. Study on geologic entity relation extraction method based on literature[J]. China Mining Magazine, 26(10): 167-172. (in Chinese with English abstract) http://en.cnki.com.cn/Article_en/CJFDTotal-ZGKA201710034.htm MINTZ M, BILLS S, SNOW R, et al., 2009. Distant supervision for relation extraction without labeled data[C]//Proceedings of the joint conference of the 47th annual meeting of the ACL and the 4th international joint conference on natural language processing of the AFNLP: Volume 2-Volume 2. Stroudsburg, PA: Association for Computational Linguistics: 1003-1011. QIAN X M, LIU J Y, CHENG P S, 2020. Distant supervised relation extraction based on densely connected convolutional networks[J]. Computer Science, 47(2): 157-162. (in Chinese with English abstract) RIEDEL S, YAO L M, MCCALLUM A, 2010. Modeling relations and their mentions without labeled text[C]//Proceedings of the 2010 European conference on machine learning and knowledge discovery in databases. Berlin: Springer-Verlag: 148-163. SOARES L B, FITZGERALD N, LING J, et al., 2019. Matching the blanks: distributional similarity for relation learning[C]//Proceedings of the 57th annual meeting of the association for computational linguistics. Florence, Italy: Association for Computational Linguistics: 2895-2905. SONG M C, LI S Z, YI P H, et al., 2014. Classification and metallogenic theory of the Jiaojia-Style gold deposit in Jiaodong Peninsula, China[J]. Journal of Jilin University (Earth Science Edition), 44(1): 87-104. (in Chinese with English abstract) http://www.researchgate.net/publication/286230657_Classification_and_metallogenic_theory_of_the_Jiaojia-style_gold_deposit_in_Jiaodong_Peninsula_China TAN Y J, WEN M, ZHU Y Q, et al., 2017. Research on the big data characteristics of geological data[J]. China Mining Magazine, 26(9): 67-71, 84. (in Chinese with English abstract) http://en.cnki.com.cn/Article_en/CJFDTotal-ZGKA201709015.htm TANG C, NUO M H, HU Y, 2020. A hybrid model for relation extraction via ResNet & BiGRU[J]. Journal of Chinese Information Processing, 34(2): 38-45. (in Chinese with English abstract) http://www.sciencedirect.com/science/article/pii/S0165168420301262 VASWANI A, SHAZEER N, PARMAR N, et al., 2017. Attention is all you need[C]//Proceedings of the 31st international conference on neural information processing systems. Red Hook, NY, USA: Curran Associates Inc. : 6000-6010. WANG Q S, ZHANG J H, YOU T, et al., 2021. Study on the multiple-element exploration method of ore beds in wells and gold exploration experiment in the area with thick cover: Taken Wuhe area in Northeast Anhui as anexample[J]. Geology and Exploration, 57(1): 136-145. (in Chinese with English abstract) http://www.researchgate.net/publication/352496172_Study_on_Optimal_Volumetric_Fracturing_Design_of_Horizontal_Tight_Oil_Wells_in_mathbfE_32_Pre-salt_Reservoir_in_Yingxi_Area_Qaidam_Basin XUE Y S, WANG R T, WANG C, et al., 2020. Ore-controlling rules of fault structures in the Wangjiaping gold deposit in Shanyang County, Shaanxi Province[J]. Journal of Geomechanics, 26(3): 391-404. (in Chinese with English abstract) YIH W T, CHANG M W, HE X D, et al., 2015. Semantic parsing via staged query graph generation: question answering with knowledge base[C]//Proceedings of the 53rd annual meeting of the association for computational linguistics and the 7th international joint conference on natural language processing (volume 1: long papers). Beijing, China: Association for Computational Linguistics: 1321-1331. ZENG D J, LIU K, CHEN Y B, et al., 2015. Distant supervision for relation extraction via piecewise convolutional neural networks[C]//Proceedings of the 2015 conference on empirical methods in natural language processing. Lisbon, Portugal: Association for Computational Linguistics: 1753-1762. ZHANG B Q, YANG Q H, ZHAO F Y, et al., 2020. The ore-bearing horizon and ore characteristics of gold deposits in the Emesishan basalt area of western Guizhou: A case study of the Jiadi gold deposite in Panxian County[J]. Geology and Exploration, 56(6): 1145-1157. (in Chinese with English abstract) ZHANG K, YANG X K, YU H B, et al., 2020. Analysis of ore-controlling structure in the Changgou gold deposit of the northern Hanyin gold orefield, southern Qinling Mountains[J]. Journal of Geomechanics, 26(3): 363-375. (in Chinese with English abstract) ZHANG X Y, YE P, WANG S, et al., 2018. Geological entity recognition method based on Deep Belief Networks[J]. Acta Petrologica Sinica, 34(2): 343-351. (in Chinese with English abstract) http://www.zhangqiaokeyan.com/academic-journal-cn_acta-petrologica-sinica_thesis/0201252011589.html ZHU Y Q, TAN Y J, WU Y L, et al., 2017. Research on semantic retrieval model towards geological big data[J]. China Mining Magazine, 26(12): 143-149. (in Chinese with English abstract) ZHU Y Q, ZHOU W W, XU Y, et al., 2017b. Intelligent learning for knowledge graph towards geological data[J]. Scientific Programming, 2017: 5072427. 蔡强, 郝佳云, 曹健, 等, 2018. 采用多尺度注意力机制的远程监督关系抽取[J]. 中文信息学报, 32(1): 96-101. doi: 10.3969/j.issn.1003-0077.2018.01.013 蔡强, 李晶, 郝佳云, 2019. 基于聚焦损失与残差网络的远程监督关系抽取[J]. 计算机工程, 45(12): 166-170. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJC201912025.htm 陈建平, 李靖, 谢帅, 等, 2017. 中国地质大数据研究现状[J]. 地质学刊, 2017, 41(3): 353-366. https://www.cnki.com.cn/Article/CJFDTOTAL-JSDZ201703001.htm 吕鹏飞, 王春宁, 朱月琴, 2017. 基于文献的地质实体关系抽取方法研究[J]. 中国矿业, 26(10): 167-172. https://www.cnki.com.cn/Article/CJFDTOTAL-ZGKA201710034.htm 钱小梅, 刘嘉勇, 程芃森, 2020. 基于密集连接卷积神经网络的远程监督关系抽取[J]. 计算机科学, 47(2): 157-162. https://www.cnki.com.cn/Article/CJFDTOTAL-JSJA202002023.htm 宋明春, 李三忠, 伊丕厚, 等, 2014. 中国胶东焦家式金矿类型及其成矿理论[J]. 吉林大学学报(地球科学版), 44(1): 87-104. https://www.cnki.com.cn/Article/CJFDTOTAL-CCDZ201401008.htm 谭永杰, 文敏, 朱月琴, 等, 2017. 地质数据的大数据特性研究[J]. 中国矿业, 26(9): 67-71, 84. https://www.cnki.com.cn/Article/CJFDTOTAL-ZGKA201709015.htm 唐朝, 诺明花, 胡岩, 2020. ResNet结合BiGRU的关系抽取混合模型[J]. 中文信息学报, 34(2): 38-45. doi: 10.3969/j.issn.1003-0077.2020.02.005 汪青松, 张金会, 尤淼, 等, 2021. 井中矿层多要素探测方法研究与厚覆盖区金矿勘查试验: 以皖东北五河地区为例[J]. 地质与勘探, 57(1): 136-145. 薛玉山, 王瑞廷, 汪超, 等, 2020. 陕西省山阳县王家坪金矿断裂构造控矿规律[J]. 地质力学学报, 26(3): 391-404. https://www.cnki.com.cn/Article/CJFDTOTAL-DZLX202003010.htm 张兵强, 杨清毫, 赵富远, 等, 2020. 贵州西部峨眉山玄武岩区金矿赋矿层位及矿石特征: 以盘县架底金矿为例[J]. 地质与勘探, 56(6): 1145-1157. https://www.cnki.com.cn/Article/CJFDTOTAL-DZKT202006004.htm 张康, 杨兴科, 于恒彬, 等, 2020. 南秦岭汉阴北部金矿田长沟金矿区控矿构造解析[J]. 地质力学学报, 26(3): 363-375. https://www.cnki.com.cn/Article/CJFDTOTAL-DZLX202003008.htm 张雪英, 叶鹏, 王曙, 等, 2018. 基于深度信念网络的地质实体识别方法[J]. 岩石学报, 34(2): 343-351. https://www.cnki.com.cn/Article/CJFDTOTAL-YSXB201802011.htm 朱月琴, 谭永杰, 吴永亮, 等, 2017. 面向地质大数据的语义检索模型研究[J]. 中国矿业, 26(12): 143-149. https://www.cnki.com.cn/Article/CJFDTOTAL-ZGKA201712027.htm