Methods of extracting biomedical information from patents and scientific publications (on the example of chemical compounds)
- Authors: Kolpakov N.A.1, Molodchenkov A.I.2,3, Lukin A.V.2,3
-
Affiliations:
- Moscow Institute of Physics and Technology (MIPT)
- Federal research center “Computer science and control” of RAS
- Peoples’ Friendship University of Russia (RUDN University)
- Issue: Vol 31, No 1 (2023)
- Pages: 64-74
- Section: Articles
- URL: https://journals.rudn.ru/miph/article/view/34463
- DOI: https://doi.org/10.22363/2658-4670-2023-31-1-64-74
- EDN: https://elibrary.ru/VNWSXI
Cite item
Full Text
Abstract
This article proposes an algorithm for solving the problem of extracting information from biomedical patents and scientific publications. The introduced algorithm is based on machine learning methods. Experiments were carried out on patents from the USPTO database. Experiments have shown that the best extraction quality was achieved by a model based on BioBERT.
Full Text
1. Introduction Every year the number of biomedical patents and scientific publications increases significantly. Often these texts don’t contain any descriptive metadata, and this, in turn, leads to a large amount of unstructured data. Consequently, there is an increasing need for tools that could accurately extract the required information from such texts. To extract infromation from texts for further processing, both machine learning approaches and algorithms based on regular expressions can be used. In [1, 2], regular expressions play a key role, and, on the contrary, in [3, 4], achievements in the field of deep machine learning, in particular the model of conditional random fields, are used. And in [5], a transformer-based machine learning technique is used, which, with proper parameter settings, can extract biomedical data quite well. Although tools have been created for analyzing and interacting with unstructured data, these solutions are often based on rules that are applicable to the specific data being processed. In this paper, we propose a solution to the problem of extracting biomedical information from patents using regular expressions. Thus, the resulting structured information can be used to train complex neural network models that will allow us to correctly extract information from a larger number of texts. 2. Related work There aren’t many solutions that solve the problem. Often, existing algorithms are designed to solve a large range of problems, so they they do not give sufficiently high results when solving the task of extracting definitions from biomedical patents and scientific publications. For example, Jinhyuk Lee and his colleagues presented BioBERT (Bidirectional Encoder Representations from Transformers for Biomedical Text Mining) [5], a transformer language model [6] developed for automatic processing of the language of the biomedical field, which is pre-trained on large biomedical texts. This model can extract biomedical named entities, biomedical relationships in the text, and can also provide answers to biomedical questions. BioBERT is initialized with the values of weight functions that were obtained for BERT [7] (this model was previously trained on texts from the English Wiki and BooksCorpus), after which BioBERT was further trained on biomedical texts (this includes annotations from PubMed and full-text PMC articles). The article [3] presents a different approach to solving NLP problems in the field of biomedicine. CLAMP (Clinical Language Annotation, Modeling, and Processing) uses both machine learning-based and rule-based methods to extract information. This toolkit allows you to extract named entities, split text into tokens, and much more. In their program, the authors use 3 types of tokenizers (to choose from): 1) OpenNLP tokenizer [8] based on machine learning; 2) tokenizer based on the separation of words by specified characters; 3) a rule-based tokenizer with various configuration parameters. And for the task of extracting named entities, the authors suggest using: 1) conditional random fields algorithm (CRF) [9]; 2) an algorithm based on a dictionary with a large amount of biomedical vocabulary collected from various resources, such as UMLS; 3) a regular expression-based algorithm for objects with common patterns. OSCAR4 (Open-Source Chemistry Analysis Routines) [2] is an open system for automatic extraction of chemical terms from scientific articles. The basis of this work is the identification of chemicals based on regular expressions and identification based on a dictionary of predefined words. But to identify complex chemical compounds (which consist of several tokens), the Markov model of maximum entropy is used. Also, there is a work [1] where the authors use morphology to extract biomedical words. The chemical object recognition system consists of two subsystems. The first extracts chemical objects and marks them in a normalized input document using a dictionary of predefined words and a morphological approach. The morphology-based approach identifies the various elements in a chemical compound and combines them to create a final compound. The second subsystem extracts additional chemical elements and distributes all recognized objects into classes of compounds and has such capabilities as decoding abbreviations and correcting spelling errors. In order to determine whether a certain entity is “chemical”, the authors collected statistical information for each individual object. This information is used as the last stage of the extraction of named entities and is intended for the classification of the extracted object (either the object is chemical or not). These methods extract information from biomedical texts in general - they aren’t aimed at extracting Markush structures [10] (see figure 1). Figure 1. An example of a Markush structure, taken from US Patent 20040171623 3. The problem statement Data concerning various biomedical patents are publicly available in various patent offices. Patents usually have a clear structure, which includes patent name, abstract, description, Claims and bibliographic information (date, patent number, authors). The section we are interested in is Claims (see the figure 2), contains a description of the chemical compounds that are claimed by the authors of the patent. This is exactly the purpose of the legal protection provided by the patent. The Claims section may contain within itself several subsections that contain information on different chemical chains. The connections presented in the Claims section can be described using the Markush structure [10] (see the figure 1). To find patents whose Markush structure is either the same or similar, you need to compare these structures. Since the Markush structure is a network model, then comparing such models directly is a very resource-intensive process. Therefore, so-called fingerprints are often used, which reflect the information presented in the Markush structures. But before that, you need to extract the information that is included in such structures, which is what this work is aimed at. The task consists in extracting chemical compounds from the Claims section (see the table 1), names of variables (in place of which various values can be substituted), chemical elements, formulas and InChI codes [11] (see the figure 3) in order to transform this textual information into some structure of formal representation. Figure 2. An example of the data in the Claims section, taken from US Patent 20120208859 Table 1 Examples of chemical compounds Compound Name Formula nitrogen monoxideAbout the authors
Nikolay A. Kolpakov
Moscow Institute of Physics and Technology (MIPT)
Email: kolpakov.na@phystech.edu
ORCID iD: 0000-0002-1640-1357
Master’s degree student of Phystech School of Applied Mathematics and Informatics
9, Institutskiy Pereulok, Dolgoprudny, Moscow Region, 141700, Russian FederationAlexey I. Molodchenkov
Federal research center “Computer science and control” of RAS; Peoples’ Friendship University of Russia (RUDN University)
Email: aim@tesyan.ru
ORCID iD: 0000-0003-0039-943X
Candidate of Technical Sciences, Federal Research Center “Computer Science and Control” of RAS employee, employee of the Peoples’ Friendship University of Russia
44-2, Vavilova St., Moscow, 119333, Russian Federation; 6, Miklukho-Maklaya St., Moscow, 117198, Russian FederationAnton V. Lukin
Federal research center “Computer science and control” of RAS; Peoples’ Friendship University of Russia (RUDN University)
Author for correspondence.
Email: antonvlukin@gmail.com
ORCID iD: 0000-0003-4391-1958
Federal Research Center “Computer Science and Control” of RAS employee, employee of the Peoples’ Friendship University of Russia
44-2, Vavilova St., Moscow, 119333, Russian Federation; 6, Miklukho-Maklaya St., Moscow, 117198, Russian FederationReferences
- S. A. Akhondi et al., “Automatic identification of relevant chemical compounds from patents,” Database: the journal of biological databases and curation, vol. 1, pp. 1-14, 2019. doi: 10.1093/database/baz001.
- D. Jessop, S. Adams, E. Willighagen, L. Hawizy, and P. Murray-Rust, “OSCAR4: A flexible architecture for chemical textmining,” Journal of cheminformatics, vol. 3, no. 1, pp. 1-12, 2011. doi: 10.1186/17582946-3-41.
- E. Soysal et al., “CLAMP - a toolkit for efficiently building customized clinical natural language processing pipelines,” Journal of the American Medical Informatics Association, vol. 25, no. 3, pp. 331-336, 2017. doi: 10.1093/jamia/ocx132.
- M. Swain and J. Cole, “ChemDataExtractor: a toolkit for automated extraction of chemical information from the scientific literature,” Journal of Chemical Information and Modeling, vol. 56, no. 10, pp. 1894-1904, 2016. doi: 10.17863/CAM.10935.
- J. Lee, W. Yoon, S. Kim, D. Kim, S. Kim, C. So, and J. Kang, “BioBERT: a pre-trained biomedical language representation model for biomedical text mining,” Bioinformatics (Oxford, England), vol. 36, no. 4, pp. 1234- 1240, 2019. doi: 10.1093/bioinformatics/btz682.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Information Processing Systems, vol. 30, pp. 5998-6008, 2017.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “BERT: pretraining of deep bidirectional transformers for language understanding,” Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1, pp. 4171-4186, 2018. doi: 10.18653/v1/N19-1423.
- The OpenNLP Project, http://opennlp.apache.org, Accessed: 202303-07.
- CRFsuite: a Fast Implementation of Conditional Random Fields (CRFs), http://www.chokkan.org/software/crfsuite/, Accessed: 2023-0307.
- J. M. Bernard, “Handling of Markush Structures,” Journal of chemical information and computer sciences, vol. 31, no. 1, pp. 64-68, 1991. doi: 10.1021/ci00001a010.
- S. Heller, A. McNaught, I. Pletnev, S. Stein, and D. Tchekhovskoi, “The IUPAC International Chemical Identifier,” Journal of Cheminformatics, vol. 7, pp. 1-34, 2015. doi: 10.1186/s13321-015-0068-4.
- USPTO, https://www.uspto.gov/patents, Accessed: 2023-03-07.
- T. Mikolov, G. Corrado, K. Chen, and J. Dean, “Efficient estimation of word representations in vector space,” Proceedings of Workshop at ICLR, pp. 1-12, 2013.
- T. Mikolov, W.-T. Yih, and G. Zweig, “Linguistic regularities in continuous space word representations,” Proceedings of NAACL-HLT, pp. 746- 751, 2013.
- C. Cortes and V. Vapnik, “Support-vector networks,” Machine Learning, vol. 20, no. 03, pp. 273-297, 1995. doi: 10.1007/BF00994018.
- J. R. Finkel, T. Grenager, and C. Manning, “Incorporating non-local information into information extraction systems by Gibbs sampling,” Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 363-370, 2005. DOI: 10.3115/ 1219840.1219885.
- T. M. Mitchell, Machine learning. McGraw-Hill New York, 1997, 432 pp.