From 3M Health Information Systems
Data Standards, Natural Language Processing, and Healthcare IT
With so many healthcare organizations evaluating applications that use natural language processing (NLP), I’m often asked if there is a specific standard that defines NLP best practice. Unstructured Information Management Architecture, or UIMA, is a technical platform that runs inside a computer process and serves to integrate a pipeline of software components, each of which executes a single NLP step (more on NLP processes and steps next time). The UIMA platform is used for NLP across many industries, not just Healthcare.
FIGURE ONE: UIMA Overview*
A widely used Healthcare-specific standard is the Unified Medical Language System, or UMLS, an important ontology, or vocabulary, widely used in open-source clinical NLP systems such as cTAKES (clinical Text Analysis and Knowledge Extraction System).
Particularly given the similarities between the acronyms, I am sometimes asked why we need NLP technologies such as UIMA when we already have UMLS. In fact the two technologies go together: a good ontology is a vital foundation for NLP, but is only part of the solution. For instance, a document may have an indication of laterality needed for an ICD-10 code, but to use it the NLP will need to identify the textual concepts to which that laterality is linguistically linked. In simple cases this may be trivial; in others the linked concept may be in another sentence, paragraph, or even a separate clinical document. The key benefit of NLP is going beyond keyword look-up to a true linguistic processing and understanding of the text. My example at the end of this post may better clarify this cooperative relationship between UIMA, UMLS, and how they are integrated into a complete NLP engine such as cTakes.
Let’s imagine a simplistic application that processes clinical text to populate a data warehouse of patients and the presence or absence of clinical conditions, expressed as UMLS CUIs, from which data mining can be performed. In our application, cTAKES provides the “annotators” – the pieces of software that perform individual NLP steps. UIMA is the “glue” that ties the annotators together in a pipeline, passing the document and annotation results from one step to another. UMLS is the “ontology” – the dictionary that defines the clinical concepts, each of which has an identifier called at CUI. In addition to providing unique identifiers for concepts, UMLS also has lists of synonyms and acronyms, maps to various code sets such as SNOMEDCT, and identifies relationships between concepts.
For example, given the sentence:
“The patient does not have pneumonia.”
Our simple application would have (at least) three annotators: (1) a tokenizer, (2) a UMLS look-up annotator (formally, a “named entity recognizer”), and (3) a negation detector. The tokenizer takes the sentence and outputs something like this (formatted for readability – this would really be in XML)
The UMLS named-entity recognizer identifies that token 6, “pneumonia”, maps to CUI C0032285, and finally the negation detector identifies that the phrase “does not have “in tokens 3-5 negates the term at position 6, which we have already identified as pneumonia. We end up with:
6: pneumonia <CUI C0032285> <negated negation_start_token = 3 negation_end_token = 5>
And now we have the information we need to store in our data warehouse.
So cTAKES provided the three software annotators, UIMA provided the glue that enabled those annotators to work together as a single pipeline, and UMLS provided the ontology and the mapping from specific terms into the ontology.
Richard Wolniewicz is a Division Scientist, Natural Language Processing at 3M Health Information Systems.
FIGURE ONE: UIMA Overview*: “UIMA Overview & SDK Setup.” Apache UIMA – Apache UIMA. July 2007. <http://uima.apache.org/downloads/releaseDocs/2.2.0-incubating/docs/html/overview_and_setup/overview_and_setup.html>.