We are proud to announce that the team LasigeUnicage achieved the 7th place (out of 232 participants and of 455 submissions) being awarded a cash prize of 5.000 USD, at the LitCoin international challenge, a competition part of the NASA Tournament Lab, hosted by NCATS (The National Center for Advancing Translational Sciences) with contributions from the National Library of Medicine (NLM). These institutions, in collaboration with Bitgrit and CrowdPlat, have come together to bring a challenge with the goal of deploying data-driven technology solutions towards accelerating scientific research in medicine and ensure that data from biomedical publications can be maximally leveraged and reach a wide range of biomedical researchers.
For this challenge, Unicage has joined with a team of top researchers from LASIGE, namely from the DeST (Deep Semantic Tagger) project team, which has focus and expertise on several Natural Language Processing (NLP) tasks applied to the biomedical domain.
Our goal was the development and integration of Unicage technology within an state-of-the-art biomedical NLP pipeline. To achieve this, the LASIGE team members focused on training and developing the deep learning models that would be used in the different parts of the competition, whilst the Unicage team members focused on developing a solution to normalize, process and filter the different biomedical datasets in a simple and efficient way so that they could be used by the deep learning models.
The LitCoin challenge was divided into two parts. We shall briefly explain the approach of the LasigeUnicage team for each of them in the following sections.
Part 1 – Named Entity Recognition
For the first part of the competition, the goal was, given only an abstract text, identifying all biomedical entities present on it. In a first step, we used the Unicage Open Version commands to convert and merge several biomedical datasets from different formats — csv, tsv, PubTator, iob, Standoff format — to CoNLL BIO format, creating a dataset for each entity category (eg: Chemicals, Diseases, Species). Using our Unicage commands, we noticed that the datasets could be efficiently processed and manipulated according to our needs since all these formats are text based. In addition, we noted that the generated scripts were less complex when compared with other approaches, such as Python, which required several distinct libraries and methods to work with the different dataset formats, thus generating more complex scripts.
The next step was training a deep learning model using these processed datasets as training sets. We used embeddings from the PubMedBERT model, originally trained on PubMed articles, together with a linear classification layer to classify each token that was fine-tuned during in each training dataset. 6 distinct models, one for each category, were obtained and ensembled before making the predictions.
Part 2 – Relation Extraction
For the second part of the competition, the goal was, given an abstract and the entities annotated from it, identify all relationships between the annotated entities and identify if those relationships were novel or not. To achieve this goal, we first used Unicage to process the dataset given by the competition.
Then, this processed dataset was used by the BiOnt system, a biomedical Relation Extraction system based on bidirectional LSTM networks. The BiOnt system incorporates the Word2Vec word embeddings and makes use of different combinations of input channels to maximize performance, including ontology embeddings. Two models were trained, one to predict the different types of relations and the other to predict if they were Novel or not.
In the final step, we processed the output generated from BiOnt using Unicage commands and Shell Scripting. These scripts used a small set of rules to choose which No/Novel tag to keep for each relation identified by the BiOnt system, while at the same time, it generated the final files in the format required by the competition.
Final Thoughts and Classification
In the end, after the best scoring systems were analyzed, the competition organization has announced the winners: https://ncats.nih.gov/funding/challenges/litcoin/winners. The Lasige-Unicage team has reached the 7th position worldwide, and a 2nd position among the European competitors, since a team from the Maastricht University, Netherlands, achieved a better classification than ours.
With this challenge, we were able to successfully integrate Unicage commands within a state-of-the-art biomedical NLP pipeline, together with a research unit of excellence from the University of Lisbon. Not only we were able to achieve top classification scores in a worldwide competition, but we also contributed to improving scientific research in the biomedical panorama. We were also able to do this using the Unicage Open Version, which allows an easy replication of our solution not only by the competition organizers, but also by researchers who need to process vast amounts of biomedical text data.
You can know more about the Unicage Open Version by clicking the button bellow.