NER: A Quick Introduction
Text Mining is a process of knowledge discovery in which relevant information is extracted from unstructured text. Text mining presents itself as an interdisciplinary method for information retrieval and encompasses several Natural Language Processing (NLP) tasks. Among these NLP tasks we have Named Entity Recognition (NER).
NER as the goal of recognizing relevant entities in the text, such as persons, locations, or organizations, but it can also be applied to specific domains of work. For example, in the biomedical domain, NER can be used to identify proteins, disease names, biological processes, drugs, chemical compounds, among others. NER is essential for many other NLP applications, such as text summarization, question answering, or even machine translation.
We can organize the type of NER solutions in 3 categories:
The Rule-Based approach is the most simple of the 3. It uses a set of rules to identify the entities in the text. A lexicon containing the entities that we want to identify might also be used. Unicage NER is an example of this approaches.
The Artificial Intelligence approaches are based on machine learning and deep learning techniques. They are able to learn how to recognize entities in texts achieving considerable results. However, they usually require a corpus of labeled text data so that the models can train, and, usually are more computationally intensive.
Finally, the Hybrid approaches combine machine learning techniques with rule-based systems. In most cases, these approaches consist in a machine learning model that is later fine-tuned with additional linguistic rules to improve its accuracy.
What is Unicage NER?
Unicage NER is a rule-based system that can be used to identify key entities in text as long as those entities are present inside a lexicon file. This lexicon is created by the user according to it’s needs. It can be composed just by two or three words, or it can be created from an entire dictionary or ontology.
Unicage NER was developed with the goal of being a fast and reliable tool to identify key biomedical entities in scientific text. However, it can be used in other domains as well. It was not made to replace current NER solutions, but rather to complement them. In addition, Unicage MER is able to differentiate itself for being a software solution capable of processing large amounts of data in a fast way without the need for GPUs or great computational requirements. This performance is possible due to the fact that our solution was developed using mostly Unicage commands and shell scripting.
Use Case: Identifying Biomedical Entities
Let us take as a use case the identification of biomedical entities present in scientific text. We have a file composed of 110.000 scientific articles that were extracted from PubMed. This file comes in json format and has approximately 227 MB in size.
With Unicage NER we shall do the following:
- Json file parsing and data normalization
- Stop word removal and tokenization
- Identification of biomedical entities
- Entity Linking with additional information
During the parsing stage, the json file is parsed using Unicage’s rjson command, that converts the file to a field separated format. This conversion allows the normalization of the data and facilitates further processing without the need of specialized tools to read json data.
Then, we start the tokenization process. Tokenization is a crucial step in NLP and it consists in breaking sequences of text into smaller fragments called tokens. These tokens can be words, sentences or even whole paragraphs, depending on the splitting criteria chosen. In our case, each token will correspond to a word. After this tokenization process is completed, we remove the stop words that are present among the tokens.
The next step is the NER phase. As we explained before, Unicage NER is a rule-based approach that uses a lexicon file comprising the key entities that we want to find in our text. In this use case, our lexicon is composed of key entities from different biomedical ontologies, namely the Disease Ontology, ChEBI, Gene Ontology and Drugbank. Through a combination of Shell and Unicage commands, we are able to identify the entities from the lexicon that are present in the tokenized text.
Finally, we have the Entity Linking step. In this step, we are able to add additional information to the entities that we have identified earlier. This additional information is also present inside our lexicon. An example can be a link to the ontology page where more detailed information about the entity.
In the following table, we present the time that we needed to run Unicage NER in a simple laptop using an Ubuntu virtual machine with 4 cores and 4GB of RAM. As we can see, Unicage NER takes less than 1 minute to run on 110.000 articles.
As to the output, two files are generated: One file contains the entities that were found in the articles, while the other contains the information from the Entity Linking.
File with Entities
20438581 neuroprotective_agent parkinson_disease 21553114 multiple_system_atrophy parkinson_disease storage 21600591 constipation neuritis parkinson_disease sleep starts toxin 21705020 cognition dementia parkinson_disease pas 21707551 memory parkinson_disease solution 21735480 application complex parkinson_disease 21763451 binding dna mtdna_replication oxidative_phosphorylation oxphos parkinson_disease parkinsonism phosphorylation xenobiotic 21780180 parkinson_disease 21834616 parkinson_disease synucleinopathy 21849176 lewy_body parkinson_disease parkinsonism syndrome 21858430 complex kynurenine metabolism parkinson_disease tryptophan tryptophan_metabolism 21868278 behavior parkinson_disease participant 21887711 parkinson_disease (...)
File with Entity Linking
pmid found_entity onto_entity onto_ID onto_link 20438581 neuroprotective_agent neuroprotective_agent CHEBI_63726 http://purl.obolibrary.org/obo/CHEBI_63726 20438581 parkinson_disease Parkinson's_disease DOID_14330 http://purl.obolibrary.org/obo/DOID_14330 21553114 multiple_system_atrophy Multiple_system_atrophy DOID_4752 http://purl.obolibrary.org/obo/DOID_4752 21553114 parkinson_disease Parkinson's_disease DOID_14330 http://purl.obolibrary.org/obo/DOID_14330 21553114 storage NO_storage GO_0035732 http://purl.obolibrary.org/obo/GO_0035732 21600591 constipation constipation DOID_2089 http://purl.obolibrary.org/obo/DOID_2089 21600591 neuritis neuritis DOID_1803 http://purl.obolibrary.org/obo/DOID_1803 21600591 parkinson_disease Parkinson's_disease DOID_14330 http://purl.obolibrary.org/obo/DOID_14330 21600591 sleep sleep GO_0030431 http://purl.obolibrary.org/obo/GO_0030431 21600591 starts starts_with GO_2001317 http://purl.obolibrary.org/obo/GO_2001317 21600591 toxin toxin CHEBI_27026 http://purl.obolibrary.org/obo/CHEBI_27026 21705020 cognition cognition GO_0050890 http://purl.obolibrary.org/obo/GO_0050890 21705020 dementia dementia DOID_1307 http://purl.obolibrary.org/obo/DOID_1307 21705020 parkinson_disease Parkinson's_disease DOID_14330 http://purl.obolibrary.org/obo/DOID_14330
Unicage NER is a simple and fast tool that can be used in the NER panorama as a complement to other existing software solutions or as a stand-alone alternative for a fast “diagnosis” of the text. Although we only present a biomedical use case, Unicage NER can be used in other domains as well.
However, Unicage NER is still a prototype that is being developed and it requires further testing and more a detailed evaluation. Nevertheless, even being a prototype, we show that it is possible to develop simple NER tools using Unicage and that the usage of Unicage technology in NLP might be a way of improving the processing speeds of existing solutions.