Unicage NER

How Unicage can be used to design a simple

Named Entity Recognition Solution

NER: A Quick Introduction

Text Mining is a process of knowledge discovery in which relevant information is extracted from unstructured text. Text mining presents itself as an interdisciplinary method for information retrieval and encompasses several Natural Language Processing (NLP) tasks. Among these NLP tasks we have Named Entity Recognition (NER).

NER as the goal of recognizing relevant entities in the text, such as persons, locations, or organizations, but it can also be applied to specific domains of work. For example, in the biomedical domain, NER can be used to identify proteins, disease names, biological processes, drugs, chemical compounds, among others. NER is essential for many other NLP applications, such as text summarization, question answering, or even machine translation.

We can organize the type of NER solutions in 3 categories:

Rule-Based

The Rule-Based approach is the most simple of the 3. It uses a set of rules to identify the entities in the text. A lexicon containing the entities that we want to identify might also be used. Unicage NER is an example of this approaches.

Artificial Intelligence

The Artificial Intelligence approaches are based on machine learning and deep learning techniques. They are able to learn how to recognize entities in texts achieving considerable results. However, they usually require a corpus of labeled text data so that the models can train, and, usually are more computationally intensive.

Hybrid

Finally, the Hybrid approaches combine machine learning techniques with rule-based systems. In most cases, these approaches consist in a machine learning model that is later fine-tuned with additional linguistic rules to improve its accuracy.

What is Unicage NER?

Unicage NER is a rule-based system that can be used to identify key entities in text as long as those entities are present inside a lexicon file. This lexicon is created by the user according to it’s needs. It can be composed just by two or three words, or it can be created from an entire dictionary or ontology.

Unicage NER was developed with the goal of being a fast and reliable tool to identify key biomedical entities in scientific text. However, it can be used in other domains as well. It was not made to replace current NER solutions, but rather to complement them. In addition, Unicage MER is able to differentiate itself for being a software solution capable of processing large amounts of data in a fast way without the need for GPUs or great computational requirements. This performance is possible due to the fact that our solution was developed using mostly Unicage commands and shell scripting.

Use Case: Identifying Biomedical Entities

Let us take as a use case the identification of biomedical entities present in scientific text. We have a file composed of 110.000 scientific articles that were extracted from PubMed. This file comes in json format and has approximately 227 MB in size.

With Unicage NER we shall do the following:

      • Json file parsing and data normalization
      • Stop word removal and tokenization
      • Identification of biomedical entities
      • Entity Linking with additional information

During the parsing stage, the json file is parsed using Unicage’s rjson command, that converts the file to a field separated format. This conversion allows the normalization of the data and facilitates further processing without the need of specialized tools to read json data. 

Then, we start the tokenization process. Tokenization is a crucial step in NLP and it consists in breaking sequences of text into smaller fragments called tokens. These tokens can be words, sentences or even whole paragraphs, depending on the splitting criteria chosen. In our case, each token will correspond to a word. After this tokenization process is completed, we remove the stop words that are present among the tokens.

The next step is the NER phase. As we explained before, Unicage NER is a rule-based approach that uses a lexicon file comprising the key entities that we want to find in our text. In this use case, our lexicon is composed of key entities from different biomedical ontologies, namely the Disease Ontology, ChEBI, Gene Ontology and Drugbank. Through a combination of Shell and Unicage commands, we are able to identify the entities from the lexicon that are present in the tokenized text. 

Finally, we have the Entity Linking step. In this step, we are able to add additional information to the entities that we have identified earlier. This additional information is also present inside our lexicon. An example can be a link to the ontology page where more detailed information about the entity.

Results

In the following table, we present the time that we needed to run Unicage NER in a simple laptop using an Ubuntu virtual machine with 4 cores and 4GB of RAM. As we can see, Unicage NER takes less than 1 minute to run on 110.000 articles.

As to the output, two files are generated: One file contains the entities that were found in the articles, while the other contains the information from the Entity Linking.

File with Entities
pmid     entities
20438581 neuroprotective_agent parkinson_disease 21553114 multiple_system_atrophy parkinson_disease storage 21600591 constipation neuritis parkinson_disease sleep starts toxin 21705020 cognition dementia parkinson_disease pas 21707551 memory parkinson_disease solution 21735480 application complex parkinson_disease 21763451 binding dna mtdna_replication oxidative_phosphorylation oxphos parkinson_disease parkinsonism phosphorylation xenobiotic 21780180 parkinson_disease 21834616 parkinson_disease synucleinopathy 21849176 lewy_body parkinson_disease parkinsonism syndrome 21858430 complex kynurenine metabolism parkinson_disease tryptophan tryptophan_metabolism 21868278 behavior parkinson_disease participant 21887711 parkinson_disease (...)
File with Entity Linking
pmid     found_entity            onto_entity             onto_ID      onto_link
20438581 neuroprotective_agent   neuroprotective_agent   CHEBI_63726  http://purl.obolibrary.org/obo/CHEBI_63726
20438581 parkinson_disease       Parkinson's_disease     DOID_14330   http://purl.obolibrary.org/obo/DOID_14330
21553114 multiple_system_atrophy Multiple_system_atrophy DOID_4752    http://purl.obolibrary.org/obo/DOID_4752
21553114 parkinson_disease       Parkinson's_disease     DOID_14330   http://purl.obolibrary.org/obo/DOID_14330
21553114 storage                 NO_storage              GO_0035732   http://purl.obolibrary.org/obo/GO_0035732
21600591 constipation            constipation            DOID_2089    http://purl.obolibrary.org/obo/DOID_2089
21600591 neuritis                neuritis                DOID_1803    http://purl.obolibrary.org/obo/DOID_1803
21600591 parkinson_disease       Parkinson's_disease     DOID_14330   http://purl.obolibrary.org/obo/DOID_14330
21600591 sleep                   sleep                   GO_0030431   http://purl.obolibrary.org/obo/GO_0030431
21600591 starts                  starts_with             GO_2001317   http://purl.obolibrary.org/obo/GO_2001317
21600591 toxin                   toxin                   CHEBI_27026  http://purl.obolibrary.org/obo/CHEBI_27026
21705020 cognition               cognition               GO_0050890   http://purl.obolibrary.org/obo/GO_0050890
21705020 dementia                dementia                DOID_1307    http://purl.obolibrary.org/obo/DOID_1307
21705020 parkinson_disease       Parkinson's_disease     DOID_14330   http://purl.obolibrary.org/obo/DOID_14330
(...)

Final Thoughts

Unicage NER is a simple and fast tool that can be used in the NER panorama as a complement to other existing software solutions or as a stand-alone alternative for a fast “diagnosis” of the text. Although we only present a biomedical use case, Unicage NER can be used in other domains as well.

However, Unicage NER is still a prototype that is being developed and it requires further testing and more a detailed evaluation. Nevertheless, even being a prototype, we show that it is possible to develop simple NER tools using Unicage and that the usage of Unicage technology in NLP might be a way of improving the processing speeds of existing solutions. 

Find out more

Request a demo and speak with our team about how you can leverage the power of Unicage in your organization.

Privacy Policy